Name: ⚡ Lightning Talk: Mastering Prefill-Decode-Disaggregated Architecture: Solutions and Best Practices in Alibaba Cloud - Jing Gu & Yang Che, Alibaba Cloud
Start: 2025-06-10T16:47:00+0800
End: 2025-06-10T17:52:00+0800

10-11 June
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Tuesday June 10, 2025 16:47 - 17:52 HKT

Level 16 | Grand Ballroom I

Disaggregating the prefill and decoding phases in LLM inference has garnered significant attention in the industry because it can enhance performance. Several solutions have been developed, including Mooncake, TetriInfer, Splitwise, DistServe, and RTP-LLM. However, deploying a disaggregation LLM inference at scale on Kubernetes, while evaluating its performance and cost benefits presents numerous challenges.
In this talk, we will introduce a solution that uses a LeaderWorkerSet as the workload, an Ingress Controller and a node discovery service. It can deploy disaggregated PD on Kubernetes, supporting multiple LLM inference engines like Mooncake and RTP-LLM with zero intrusion. Furthermore, we will discuss improving load balancing using Envoy and ORCA, based on KVCache and metrics, and recommending optimal ratios for the PD phases. Finally, we will cover essential features for production deployment such as high availability, elastic scaling, canary releases, and observability.

Speakers

Yang Che

senior software engineer, Alibaba Cloud

Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →

Jing Gu

Software Engineer, Alibaba Cloud

Jing Gu is a senior engineer at Alibaba Cloud. She works on Alibaba Cloud Container Service for Kubernetes , focusing on serving large language models (LLMs) within Kubernetes and optimizing LLM inference processes.

Mastering Prefill Decode Disaggregated Architecture v3 pdf

Tuesday June 10, 2025 16:47 - 17:52 HKT
Level 16 | Grand Ballroom I

⚡ Lightning Talks, AI + ML

Content Experience Level Intermediate
Presentation Language Chinese

Need help? View Support Guides
Event questions? Contact Event Planner

KubeCon + CloudNativeCon China 2025

Yang Che

Jing Gu

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!