Loading…
10-11 June
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00)To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Venue: Level 19 | Crystal Court I clear filter
Tuesday, June 10
 

11:00 HKT

AI Model Distribution Challenges and Best Practices - Wenbo Qi & Xiaoya Xia, Ant Group; Eryu Guan, Aliyun; Wenpeng Li, Alibaba Cloud; Han Jiang, Kuaishou
Tuesday June 10, 2025 11:00 - 11:30 HKT
As the demand for scalable AI/ML grows, efficiently distributing AI models in cloud-native infrastructure has become a pivotal challenge for enterprises. The panel dives into the technical and operational strategies for deploying models at scale -- from optimizing model storage and transfer to ensuring consistency across clusters and regions. Experts from different companies and CNCF projects will debate critical questions like: How can Kubernetes-native workflows automate and accelerate model distribution while minimizing latency and bandwidth costs? How to efficiently distribute huge models sizing hundreds of GBs or TBs? What are the challenges proposed by distributed inference and the prefilling-decoding architecture? How are models updated in the reinforcement learning post-training paradigm? What role do standards like OCI artifacts or specialized registries play in streamlining versioned model delivery?
Speakers
avatar for Han Jiang

Han Jiang

Software Engineer, Kuaishou
Software Engineering from Kuaishou, previously worked in the Kubernetes ecosystem and container-related technologies. Currently, he is focused on optimizing the inference performance of large language models. 
avatar for Xiaoya

Xiaoya

Open Source Analyst, Ant Group
Xiaoya Xia is a member of the Ant Group OSPO, where she focuses on catalyzing open source success through data-driven insights. Before joining Ant Group, Xiaoya was a PhD at East China Normal University (ECNU), where she concentrated on research into open source ecosystem sustain... Read More →
avatar for Wenbo Qi

Wenbo Qi

Software Engineer, Ant Group
Wenbo Qi is a software engineer at Ant Group working on Dragonfly. He is a maintainer of the Dragonfly. He hopes to do some positive contributions to open source software and believe that fear springs from ignorance.
avatar for Eryu Guan

Eryu Guan

Software Engineer, Aliyun
Software Engineer, Aliyun
WL

Wenpeng Li

Alibaba Cloud
Tuesday June 10, 2025 11:00 - 11:30 HKT
Level 19 | Crystal Court I
  AI + ML

11:45 HKT

Defining a Specification for AI/ML Artifacts - Fog Dong, BentoML; Gorkem Ercan, Jozu; Peng Tao & Chlins Zhang, Ant Group; Xudong Wang, Paypal
Tuesday June 10, 2025 11:45 - 12:15 HKT
AI has become a prominent figure in the cloud native ecosystem and there continues to be massive adoption in this emerging field. As frameworks and approaches are introduced, a pattern has emerged which threatens the ability to manage at scale: each implementation introduces their own format, runtime, and different ways of working, fragmenting the ecosystem. On other hand, open standards are the backbone of cohesive and scalable ecosystems.

This panel discussion seeks to explore the importance of defining standards within the CNCF ecosystem, particularly focusing on AI/ML artifacts. Beyond the advantages of the standard in facilitating integration with existing cloud native tools, this conversation will delve into how the standards can serve as a foundation for innovation. Join us to understand how standardization with innovative approaches can advance the cloud native AI landscape.
Speakers
avatar for Chlins Zhang

Chlins Zhang

Software Engineer, Ant Group
Chenyu Zhang is a software engineer at Ant Group, currently mainly responsible for the development and maintenance of project harbor, and also has some experience in devops and cloud native related technology stacks.
avatar for Peng Tao

Peng Tao

Staff Engineer, Ant Group
Kata Containers architecture committee member, Nydus maintainer, and Linux kernel developer.
avatar for Fog Dong

Fog Dong

Senior Software Engineer, BentoML
董天欣目前在 BentoML担任资深工程师,同时,她也是 KubeVela 的核心维护者以及 CNCF 大使。她致力于开源社区的建设,并不遗余力地为推动开源项目的发展而努力,尤其是在云原生 DevOps 领域。目前,她在 BentoML... Read More →
avatar for Gorkem Ercan

Gorkem Ercan

CTO, Jozu
Gorkem Ercan is a co-founder and CTO of Jozu. Gorkem has experience working and leading teams with various technologies ranging from building IDEs, to building mobile phones, and CI/CD systems. He is an avid contributor and supporter of open source and previously served at the Eclipse... Read More →
Tuesday June 10, 2025 11:45 - 12:15 HKT
Level 19 | Crystal Court I
  AI + ML

13:45 HKT

Fast and Furious: Practice in Horizon Robotics on Large-scale End-to-end Model Training - Chen Yangxue, Horizon Robotics & Zhihao Xu, Alibaba Cloud
Tuesday June 10, 2025 13:45 - 14:15 HKT
End-to-end large model training is crucial for advancing autonomous driving technology. Horizon Robotics leads in this field by leveraging deep learning algorithms and chip design. They efficiently train and deploy advanced perception models like Sparse4D using cloud-native technologies.
Training these models poses challenges, such as managing massive video data and numerous small files. Ensuring high-performance training with over 2000 GPUs on RDMA, quickly identifying different failures, and diagnosing issues in large-scale training.
This session covers how Horizon Robotics manages large-scale training on Kubernetes. It highlights the role of distributed data caching, network topology awareness, and job affinity scheduling in optimizing a 2000 GPU training job. We'll also discuss strategies for restoring interrupted training jobs through backup machine replacement to enhance task resilience. Furthermore, experiences with CNCF projects like Volcano, Fluid, and NPD will be shared.
Speakers
avatar for Zhihao Xu

Zhihao Xu

Software Engineer, Alibaba Cloud
Zhihao Xu is currently a software engineer at Alibaba Cloud focusing on infrastructure for AI model training and large-scale model inference. Also, he is now a Maintainer of the CNCF sandbox project Fluid, which is designed for data orchestration for data-intensive applications running... Read More →
avatar for Chen Yangxue

Chen Yangxue

Software Engineer, Horizon Robotics
I'm Chen Yangxue, a software engineer at Horizon Robotics. With years of cloud - native experience, I'm building a ten - thousand - card training platform with a hybrid cloud setup.I've used tools like Kubernetes, Volcano, etc., to solve tough technical problems. I know how to optimize... Read More →
Tuesday June 10, 2025 13:45 - 14:15 HKT
Level 19 | Crystal Court I
  AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

14:30 HKT

More Than Model Sharding: LWS & Distributed Inference - Peter Pan & Nicole Li, DaoCloud
Tuesday June 10, 2025 14:30 - 15:00 HKT
Large LLM like Llama3.1-405B or Deepseek-V3 (671B), require distributed inference across multiple-nodes like vLLM + Ray backend.
However, it's more than just model-slicing with tensor-parallelism, Native K8S treats those workloads across nodes irrelevantly , so challenges come:
- standalone statefulSets without coordination
- demand of Gang-scheduling
- uncontrolled startup order among master & workers, causing boot lag
- HPA as a whole instead of for each sts, to scale together for both Ray head/worker.
- stable index and rank
- topology aware grouping
- failure recovery for vllm/pytorch(not smart enough), to avoid one pod/GPU failure disrupting overall inference

----
So LWS - LeaderWorkerSet (github.com/kubernetes-sigs/lws) , is designed to address them:
- to optimize resource coordination with leader-worker set
- improve performance thru co-location
- integrate scaling with HPA for whole lws together
- all-or-nothing restart policy to fault tolerance as a group.
Speakers
avatar for Nicole Li

Nicole Li

Cloud Native Developer, DaoCloud
Cloud Native Developer, Service Mesh & Istio Contributor, AI Newbie
avatar for Peter Pan

Peter Pan

R&D Engineering VP, Daocloud
- DaoCloud Software Engineering VP- Regular KubeCon "Program Committee" : 2023 EU, 2024 HK, 2024 India, 2025 EU- Regular KubeCon Speaker: 2023 SH, 2024 EU, 2024 HK- Maintainer of below CNCF projects : cloudtty, kubean, hwameistor- CNCF WG-AI (AI Working-Group) Member + CNAI white-paper... Read More →
Tuesday June 10, 2025 14:30 - 15:00 HKT
Level 19 | Crystal Court I
  AI + ML

15:30 HKT

Smart GPU Management: Dynamic Pooling, Sharing, and Scheduling for AI Workloads in Kubernetes - Wei Chen, China Unicom Cloud Data & Mengxuan Li, Dynamia
Tuesday June 10, 2025 15:30 - 16:00 HKT
With the rapid growth of AI applications, optimal GPU utilization is essential, particularly in GPU sharing and job scheduling. Balancing performance, flexibility, and isolation is as challenging as the “Impossible Trinity”. Technologies such as vCUDA, MPS, and MIG are promising attempts, but each has its pros and cons. Managing clusters with multiple sharing techniques adds complexity due to differing resource names and configurations.
In this talk, we will demonstrate how to combine these methods easily. Users specify the memory and core count without managing GPU types or sharing methods. Based on user preferences and GPU resources, the best node and method will be selected. Requests are automatically translated into optimal profiles, and GPUs are dynamically partitioned.
This approach streamlines GPU management, enhances utilization, and improves scheduling. By integrating Volcano and HAMi, the solution strengthens GPU pooling and scheduling, optimizing AI workload management.
Speakers
avatar for Mengxuan Li

Mengxuan Li

Software Engineer, Dynamia Inc
Member of volcano community responsible for the development of gpu virtualization mechanism on volcano. It have been merged in the master branch of volcano, and will be released in v1.8. speaker, in OpenAtom Global Open Source Commit#2023
avatar for Wei Chen

Wei Chen

Technical expert, China Unicom Cloud Data Co., Ltd
I am a technical expert at China Unicom Cloud Data Co., Ltd, specializing in cloud computing infrastructure. I actively contribute to open-source projects, including KubeEdge, Openeular iSula, and Volcano.
Tuesday June 10, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court I
  AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

16:15 HKT

Introducing AIBrix: Cost-Effective and Scalable Kubernetes Control Plane for VLLM - Jiaxin Shan & Liguang Xie, ByteDance
Tuesday June 10, 2025 16:15 - 16:45 HKT
Managing large-scale LLM inference workloads on Kubernetes requires more than just high-performance inference engines like vLLM. It demands a comprehensive control plane that integrates deeply with engines while addressing the complexities of large-scale operations. This need inspired the creation of AIBrix, a Kubernetes-native control plane designed to scale LLM inference with modularity, flexibility, and cutting-edge algorithms.

AIBrix introduces a pluggable architecture with components for LLM specific autoscaling, high-density lora management, distributed KV cache, heterogenous serving, model loading etc. AIBrix emphasizes deep co-design with inference engines, enabling advanced features and optimizations. This talk will demonstrate AIBrix in action, showcasing its ability to improve scalability and optimize resource utilization. Additionally, we will present detailed benchmarks to evaluate the performance of these components, providing actionable insights for practitioners.
Speakers
avatar for Jiaxin

Jiaxin

Software Engineer, Bytedance
Jiaxin works at ByteDance Infrastructure Lab, focusing on serverless and AI infrastructure. He is also a co-chair of Kubernetes WG-Serving, Jiaxin drives innovations and contributes to the future of scalable AI systems.
avatar for Liguang Xie .

Liguang Xie .

Director of Engineering, ByteDance
Liguang Xie is an Engineering Lead at ByteDance’s Compute Infrastructure Team, leading next-gen serverless infrastructure design and overseeing open-source, research, and engineering efforts. He has extensive experience in large-scale distributed systems, AI/ML platforms, and LLM/GNN... Read More →
Tuesday June 10, 2025 16:15 - 16:45 HKT
Level 19 | Crystal Court I
  AI + ML

17:00 HKT

Portrait Service: AI-Driven PB-Scale Data Mining for Cost Optimization and Stability Enhancement - Yuji Liu & Zhiheng Sun, Kuaishou
Tuesday June 10, 2025 17:00 - 17:30 HKT
Kuaishou's Kubernetes-based platform manages 200,000+ machines and 10M+ Pods, generating 10TB+ daily data. AI-driven intelligent portrait service enhances stability and performance:
● Stability Management: AI analyzes system and workload metrics to generate machine health scores, integrated into Kubernetes scheduling to evict/avoid unhealthy nodes. This reduced pod creation delays from 20 to 0.1 cases/day and boosted service availability from 90% to 99.99%.
● Performance Optimization:
Serving 10,000+ services with diverse resource sensitivities (compute-, cache-, and IO-intensive), we combine AI with microarchitecture data to pinpoint bottlenecks and create application profiles. Optimizing resource allocation (compute, cache, memory bandwidth) has increased average IPC by 20% and reduced LLC miss rates for cache-sensitive services from over 50% to 10%.
Future plans include integrating AI Agent technology to automate anomaly detection and reduce manual operations by 80%.
Speakers
avatar for Yuji Liu

Yuji Liu

Software Engineer, Kuaishou Technology
Container cloud engineer from Kuaishou.
avatar for Zhiheng Sun

Zhiheng Sun

Senior Software Engineer, Kuaishou
I am a cloud-native engineer at kwaishou, specializing in application performance improvement on Kubernetes. I also have led the open-local, a cloud-native local storage project in the open-source community.
Tuesday June 10, 2025 17:00 - 17:30 HKT
Level 19 | Crystal Court I
  AI + ML
 
Wednesday, June 11
 

11:00 HKT

Resilient Multiregion Global Control Planes With Crossplane and K8gb - Yury Tsarev & Steven Borrelli, Upbound
Wednesday June 11, 2025 11:00 - 11:30 HKT
Ensuring resilience in control planes is critical for organizations managing infrastructure and applications across multiple regions with Kubernetes. This talk presents a reference architecture for creating a Crossplane-based Global Control Plane, enhanced with k8gb for DNS-based failover and leveraging an Active/Passive setup.
We’ll explore how Crossplane’s declarative infrastructure provisioning integrates with k8gb to build robust, scalable, and resilient multicluster environments. Key takeaways include:

- Architecting resilient multiregion control planes with Active/Passive roles
- Demonstrating failover mechanisms where the Passive control plane transitions to Active during failures
- Strategies for optimizing failover times while maintaining availability

This session will guide attendees through proven methods and real-world challenges of building resilient Global Control Planes, empowering them to manage critical workloads across geographically distributed regions confidently.
Speakers
avatar for Steven Borrelli

Steven Borrelli

Principal Soutions Architect, Upbound
Steven is a Principal Solutions Architect for Upbound, where he helps customers adopt Crossplane.
avatar for Yury Tsarev

Yury Tsarev

Principal Solutions Architect, Upbound
Yury is an experienced software engineer who strongly focuses on open-source, software quality and distributed systems. As the creator of k8gb (https://www.k8gb.io) and active contributor to the Crossplane ecosystem, he frequently speaks at conferences covering topics such as Control... Read More →
Wednesday June 11, 2025 11:00 - 11:30 HKT
Level 19 | Crystal Court I
  Operations + Performance

11:45 HKT

Kube Intelligence - A Metric Based Insightful Remediation Recommender - Yash Bhatnagar, Google
Wednesday June 11, 2025 11:45 - 12:15 HKT
Not everything can be thought about while designing or developing the applications, and as such lot of the design decisions are based on estimates and potential usage patterns.

More often that not, these estimates differ from reality and introduce inefficiencies in the system across several fronts - and if at all visible, it always much later in the lifecycle when you already have several customers & high footprint.

And hence, unless there is a clear sign of performance degradation or unjustified costs, there is often no incentive to invest time & effort for some unknown gains.

In this session Yash will outline a real world case study about how they went about building an internal platform for handling several aspects of post deployment challenges like

1. rightsizing opportunities,
2. architecture migrations like moving to serverless,
3. finding right maintenance windows, etc

by using a wide range of metrics, and how impactful these minor optimizations turned out to be.
Speakers
avatar for Yash Bhatnagar

Yash Bhatnagar

Software Engineer, Google
Yash is working with Google as Software Engineer, and has 9 years of industrial experience with cloud architectures and micro-service development across Google and VMware. He has been a speaker at several international conferences such as KubeCon + CloudNativeCon and Open Source... Read More →
Wednesday June 11, 2025 11:45 - 12:15 HKT
Level 19 | Crystal Court I
  Platform Engineering
  • Content Experience Level Any
  • Presentation Language English

13:45 HKT

Progressive Delivery Made Easy With Argo Rollouts - Kevin Dubois, Red Hat
Wednesday June 11, 2025 13:45 - 14:15 HKT
ou might already be using a CI/CD solution, but are you 100% sure things will roll out without a glitch once you go to production?  Unfortunately differences between testing/staging and production environments are virtually unavoidable. There’s always a risk for unforeseen issues related to your production environment and/or actual load which can lead to potential disruptions to your users.

Progressive delivery is the next step after Continuous Delivery to roll out your application in a controlled and automated way so you can verify and test your application *in production* before it becomes fully available to all your user bases.

Embrace GitOps and Progressive Delivery with techniques like blue-green, canary release, shadowing traffic, dark launches and automatic metrics-based rollouts to validate the application in production using Kubernetes and tools like Istio, Prometheus, ArgoCD, and Argo Rollouts.

Come to this session to learn about Progressive Delivery in action using Kubernetes.
Speakers
avatar for Kevin Dubois

Kevin Dubois

Senior Principal Developer Advocate, Red Hat
Kevin is a Java Champion, software engineer, author and international speaker with a passion for Open Source, Java, and Cloud Native Development & Deployment practices. He currently works as developer advocate at Red Hat where he gets to enjoy working with Open Source projects and... Read More →
Wednesday June 11, 2025 13:45 - 14:15 HKT
Level 19 | Crystal Court I
  Platform Engineering

14:30 HKT

The Past, the Present, and the Future of Platform Engineering - Mauricio "Salaboy" Salatino, Diagrid & Viktor Farcic, Upbound
Wednesday June 11, 2025 14:30 - 15:00 HKT
Do you think platform engineering is too hard? Or is it just a buzzword? Is the CNCF landscape too tricky to visualize? If you’ve been in this industry long enough, you should know that platform engineering has been around for a long time.

Most of us have been trying to build developer platforms for decades, and most of us have failed at that. That begs the questions: “What is different now?” “Why will this time be different?” and “Do we have a chance to succeed?”

We’ll take a look at the past, the present, and the future of platform engineering. We’ll see what we were doing in the past, what we did wrong, and why we failed. Further on, we’ll see what we (the industry as a whole) are doing now and, more importantly, where we might go from here.

Get ready for the hard truths and challenges you will face when trying to build a platform based on Kubernetes. Join us for a pain-infused journey filled with challenges teams will face when building platforms to enable other teams.
Speakers
avatar for Viktor Farcic

Viktor Farcic

Viktor Farcic, Upbound
Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.
avatar for Mauricio Salatino

Mauricio Salatino

Software Engineer, Diagrid
Mauricio works as an Open Source Software Engineer at @Diagrid, contributing to and driving initiatives for the Dapr OSS project. Mauricio also serves as a Steering Committee member for the Knative Project and Co-Leading the Knative Functions initiative. He published a book titled... Read More →
Wednesday June 11, 2025 14:30 - 15:00 HKT
Level 19 | Crystal Court I
  Platform Engineering

15:30 HKT

Composable Platforms: Modular Platform Engineering With Kratix and Backstage - Hossein Salahi, Liquid Reply
Wednesday June 11, 2025 15:30 - 16:00 HKT
Constructing and managing platforms for diverse teams and workloads presents a significant challenge in today's cloud-native environment. This session introduces the concept of composable platforms, using modular, reusable components as the foundation for platform engineering. This talk will demonstrate how using Kratix, a workload-centric framework, and Backstage an extensible developer portal enables the creation of self-service platforms that balance standardization with adaptability.

The session will detail platform design for scalability and governance, streamlining developer workflows through Backstage, and using Kratix Promises for varied workload requirements. Attendees will gain practical insights into building scalable and maintainable platforms through real-world examples, architectural patterns, and a live demonstration of a fully integrated Kratix-Backstage deployment.
Speakers
avatar for Hossein Salahi

Hossein Salahi

Tech Lead, Liquid Reply
Hossein is an experienced cloud computing professional with nearly a decade of expertise in distributed systems and cloud technologies. He began as a student specializing in cloud automation and progressed to a full-time role focusing on on-premises cloud infrastructure and containers... Read More →
Wednesday June 11, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court I
  Platform Engineering

16:15 HKT

Taming Dependency Chaos for LLM in K8s - Peter Pan, Neko Ayaka & Kebe Liu, DaoCloud
Wednesday June 11, 2025 16:15 - 16:45 HKT
AI developer in K8S: either in Jupyter notebook or LLM serving: Python Dependency is always a headache :
- Prepare a set of base Images? The maintenance amounts & efforts will be a nightmare: Since (1) packages in AI world are rapidly version bumping, (2) diff llm codes require diff packages permutation/combination.
- Leave users to `pip install` by themselves ? The resigned waiting blocks productivity and efficiency. You may agree if you did it.
- If on a GPU Cloud, the pkg preparation time may even cost a lot: you rent a GPU but wasted in waiting pip downloading...
- you may choose to D.I.Y: docker-commit your own base-images, but you have to worry about the Dockerfile, registry and additional cloud cost if you don't have local docker env.

----
So we introduce https://github.com/BaizeAI/dataset.

The solution:
1. A CRD to describe the dependency and env.
2. K8S Job to pre-load the packages.
3. PVC to store and mount
4. `conda` to switch from envs
5. share between namespaces
Speakers
avatar for Peter Pan

Peter Pan

R&D Engineering VP, Daocloud
- DaoCloud Software Engineering VP- Regular KubeCon "Program Committee" : 2023 EU, 2024 HK, 2024 India, 2025 EU- Regular KubeCon Speaker: 2023 SH, 2024 EU, 2024 HK- Maintainer of below CNCF projects : cloudtty, kubean, hwameistor- CNCF WG-AI (AI Working-Group) Member + CNAI white-paper... Read More →
avatar for Kebe Liu

Kebe Liu

DaoCloud, Senior software engineer, DaoCloud
AI Infra and Service Mesh Team Lead at DaoCloud. Member of Istio Steering Committee. Creator of open source projects such as Merbridge and kcover.
avatar for Neko Ayaka

Neko Ayaka

Senior Software Engineer, DaoCloud
Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase
Wednesday June 11, 2025 16:15 - 16:45 HKT
Level 19 | Crystal Court I
  Application Development
  • Content Experience Level Any
  • Presentation Language English
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.