Loading…
10-11 June
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00)To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Wednesday June 11, 2025 09:57 - 10:07 HKT
With the growing demand for heterogeneous computing power, Chinese users are gradually adopting domestic GPUs, especially for inference. vLLM, the most popular open-source inference project, has drawn widespread attention but does not support domestic chips.Chinese inference engines are still developing in functionality, performance, and ecosystem. In this session, we’ll introduce how to adapt vLLM to support domestic GPUs,enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill. We’ll also cover performance bottleneck analysis and chip operator development to maximize hardware potential.
Additionally, Kubernetes has become the standard for container orchestration and is the preferred platform for inference services. We’ll show how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with a few lines of code, and explore how llmaz handles heterogeneous GPU scheduling and our practices for monitoring and elastic scaling.
Speakers
avatar for Haiwen Zhang

Haiwen Zhang

Senior Software Engineer, China Mobile (Suzhou) Software Technology Co., Ltd.
The author has rich experience in cloud-native and AI inference development, currently works at China Mobile, focusing on the research and development of cloud-native and AI inference related products. He shared experiences of service mesh at some technical conferences such as the... Read More →
avatar for Kante Yin

Kante Yin

Software Engineer, DaoCloud
Kante is a senior software engineer and an open source enthusiast from DaoCloud, his work is mostly around scheduling, resource management and LLM inference. He actively contributes to upstream Kubernetes as SIG-Scheduling Maintainer and helps in incubating several projects like Kueue... Read More →
Wednesday June 11, 2025 09:57 - 10:07 HKT
Level 19 | Crystal Court I+II
  Keynote Sessions, AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link