Loading…
10-11 June
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00)To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Audience: Chinese clear filter
Tuesday, June 10
 

11:00 HKT

AI Model Distribution Challenges and Best Practices - Wenbo Qi & Xiaoya Xia, Ant Group; Eryu Guan, Aliyun; Wenpeng Li, Alibaba Cloud; Han Jiang, Kuaishou
Tuesday June 10, 2025 11:00 - 11:30 HKT
As the demand for scalable AI/ML grows, efficiently distributing AI models in cloud-native infrastructure has become a pivotal challenge for enterprises. The panel dives into the technical and operational strategies for deploying models at scale -- from optimizing model storage and transfer to ensuring consistency across clusters and regions. Experts from different companies and CNCF projects will debate critical questions like: How can Kubernetes-native workflows automate and accelerate model distribution while minimizing latency and bandwidth costs? How to efficiently distribute huge models sizing hundreds of GBs or TBs? What are the challenges proposed by distributed inference and the prefilling-decoding architecture? How are models updated in the reinforcement learning post-training paradigm? What role do standards like OCI artifacts or specialized registries play in streamlining versioned model delivery?
Speakers
avatar for Han Jiang

Han Jiang

Software Engineer, Kuaishou
Software Engineering from Kuaishou, previously worked in the Kubernetes ecosystem and container-related technologies. Currently, he is focused on optimizing the inference performance of large language models. 
avatar for Xiaoya

Xiaoya

Open Source Analyst, Ant Group
Xiaoya Xia is a member of the Ant Group OSPO, where she focuses on catalyzing open source success through data-driven insights. Before joining Ant Group, Xiaoya was a PhD at East China Normal University (ECNU), where she concentrated on research into open source ecosystem sustain... Read More →
avatar for Wenbo Qi

Wenbo Qi

Software Engineer, Ant Group
Wenbo Qi is a software engineer at Ant Group working on Dragonfly. He is a maintainer of the Dragonfly. He hopes to do some positive contributions to open source software and believe that fear springs from ignorance.
avatar for Eryu Guan

Eryu Guan

Software Engineer, Aliyun
Software Engineer, Aliyun
WL

Wenpeng Li

Alibaba Cloud
Tuesday June 10, 2025 11:00 - 11:30 HKT
Level 19 | Crystal Court I
  AI + ML

11:00 HKT

An Alternative Metadata System for Large Kubernetes Clusters - Yingcai Xue & Yixiang Chen, ByteDance
Tuesday June 10, 2025 11:00 - 11:30 HKT
For an event-driven distributed system like Kubernetes, where components communicate by synchronizing incremental data through the KubeAPIServer, the metadata system is the most critical component. The ETCD is the only official supported metadata system, but some projects like kine explored alternative metadata storage, but they're either not open-sourced or have performance issues.
This talk covers ByteDance's work on high-performance Kubernetes metadata systems. It summarizes ETCD's production issues, analyzes Kubernetes' metadata storage requirements and introduces how we solve it with kubebrain.
Actual results from large-scale environments (over 20K nodes, 1M pods over years) show that KubeBrain enhances cluster performance and stability.
This talk helps understand the challenges of metadata systems in large-scale clusters and provides insights into an open-source solution that has been practiced in ByteDance's production environment.
Speakers
avatar for Yixiang Chen

Yixiang Chen

Software Engineer, ByteDance
Yixiang is a seasoned cloud-native technologist with over 9 years of hands-on experience at ByteDance, where he has been at the forefront of large-scale Kubernetes ecosystem innovations. As a core contributor in cloud-native infrastructure, his expertise spans multiple domains including... Read More →
avatar for Yingcai Xue

Yingcai Xue

Software Engineer, ByteDance
- graduated from Zhejiang University with a master degree
Tuesday June 10, 2025 11:00 - 11:30 HKT
Level 19 | Crystal Court II
  Operations + Performance

11:45 HKT

Building Ultra-Large-Scale Cloud Native Edge Systems Using Chaos Engineering - Yue Bao, Huawei Cloud Computing Technology & Yue Li, DaoCloud
Tuesday June 10, 2025 11:45 - 12:15 HKT
Fast growing technologies, such as 5G networks, industrial Internet, and AI, are giving edge computing an important role in driving digital transformation. As each new technology brings benefits, it brings challenges. First, there are massive heterogeneous edge devices and it encompass a broad range of device types. Second, Edge devices are often located in unstable and complex physical and network environments, such as limited bandwidth, high latency, etc. How to overcome these challenges and build a stable, large-scale edge computing platform needs to be resolved.
KubeEdge is an open source edge computing framework that extends the power of kubernetes from central cloud to edge. Now, Kubernetes clusters powered by KubeEdge, can stably support 100,000 edge nodes and manage more than one million pods.
In this session, we will share the Key challenges of manage massive heterogeneous edge nodes and tell how using ChaosMesh to makes KubeEdge more Reliable in large-scale edge nodes.
Speakers
avatar for Yue Bao

Yue Bao

Senior Software Engineer, Huawei Cloud Computing Technology Co., Ltd.
Yue Bao serves as a software engineer of Huawei Cloud. She is now working 100% on open source, focusing on lightweight edge for KubeEdge. She is the maintainer of KubeEgde and also the tech leader of KubeEdge SIG Release and Node. Before that, Yue worked on Huawei Cloud Intelligent... Read More →
avatar for yue li

yue li

Software Quality Engineer, DaoCloud
work at DaoCloud as Quality Director, more than 20 years IT industry experience, China Mobile, Siemens, HP, EMC, and startup company. Newcomer in Cloud Native and open source fans. Would like to adopt open source projects to improve enterprise software quality with fast release.
Tuesday June 10, 2025 11:45 - 12:15 HKT
Level 19 | Crystal Court II
  Operations + Performance
  • Content Experience Level Any
  • Presentation Language Chinese

13:45 HKT

Fast and Furious: Practice in Horizon Robotics on Large-scale End-to-end Model Training - Chen Yangxue, Horizon Robotics & Zhihao Xu, Alibaba Cloud
Tuesday June 10, 2025 13:45 - 14:15 HKT
End-to-end large model training is crucial for advancing autonomous driving technology. Horizon Robotics leads in this field by leveraging deep learning algorithms and chip design. They efficiently train and deploy advanced perception models like Sparse4D using cloud-native technologies.
Training these models poses challenges, such as managing massive video data and numerous small files. Ensuring high-performance training with over 2000 GPUs on RDMA, quickly identifying different failures, and diagnosing issues in large-scale training.
This session covers how Horizon Robotics manages large-scale training on Kubernetes. It highlights the role of distributed data caching, network topology awareness, and job affinity scheduling in optimizing a 2000 GPU training job. We'll also discuss strategies for restoring interrupted training jobs through backup machine replacement to enhance task resilience. Furthermore, experiences with CNCF projects like Volcano, Fluid, and NPD will be shared.
Speakers
avatar for Zhihao Xu

Zhihao Xu

Software Engineer, Alibaba Cloud
Zhihao Xu is currently a software engineer at Alibaba Cloud focusing on infrastructure for AI model training and large-scale model inference. Also, he is now a Maintainer of the CNCF sandbox project Fluid, which is designed for data orchestration for data-intensive applications running... Read More →
avatar for Chen Yangxue

Chen Yangxue

Software Engineer, Horizon Robotics
I'm Chen Yangxue, a software engineer at Horizon Robotics. With years of cloud - native experience, I'm building a ten - thousand - card training platform with a hybrid cloud setup.I've used tools like Kubernetes, Volcano, etc., to solve tough technical problems. I know how to optimize... Read More →
Tuesday June 10, 2025 13:45 - 14:15 HKT
Level 19 | Crystal Court I
  AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

14:30 HKT

More Than Model Sharding: LWS & Distributed Inference - Peter Pan & Nicole Li, DaoCloud
Tuesday June 10, 2025 14:30 - 15:00 HKT
Large LLM like Llama3.1-405B or Deepseek-V3 (671B), require distributed inference across multiple-nodes like vLLM + Ray backend.
However, it's more than just model-slicing with tensor-parallelism, Native K8S treats those workloads across nodes irrelevantly , so challenges come:
- standalone statefulSets without coordination
- demand of Gang-scheduling
- uncontrolled startup order among master & workers, causing boot lag
- HPA as a whole instead of for each sts, to scale together for both Ray head/worker.
- stable index and rank
- topology aware grouping
- failure recovery for vllm/pytorch(not smart enough), to avoid one pod/GPU failure disrupting overall inference

----
So LWS - LeaderWorkerSet (github.com/kubernetes-sigs/lws) , is designed to address them:
- to optimize resource coordination with leader-worker set
- improve performance thru co-location
- integrate scaling with HPA for whole lws together
- all-or-nothing restart policy to fault tolerance as a group.
Speakers
avatar for Nicole Li

Nicole Li

Cloud Native Developer, DaoCloud
Cloud Native Developer, Service Mesh & Istio Contributor, AI Newbie
avatar for Peter Pan

Peter Pan

R&D Engineering VP, Daocloud
- DaoCloud Software Engineering VP- Regular KubeCon "Program Committee" : 2023 EU, 2024 HK, 2024 India, 2025 EU- Regular KubeCon Speaker: 2023 SH, 2024 EU, 2024 HK- Maintainer of below CNCF projects : cloudtty, kubean, hwameistor- CNCF WG-AI (AI Working-Group) Member + CNAI white-paper... Read More →
Tuesday June 10, 2025 14:30 - 15:00 HKT
Level 19 | Crystal Court I
  AI + ML

14:30 HKT

Unlocking Kyverno: Mastering Policy Management in Large-Scale Kubernetes Clusters - Di Xu, Xiaohongshu & Xu Liu, RedNote
Tuesday June 10, 2025 14:30 - 15:00 HKT
With the growing adoption of Kubernetes, managing configurations and ensuring compliance across extensive clusters becomes increasingly complex. Kyverno, a native Kubernetes policy engine, offers a streamlined solution to these challenges. In this session, we'll explore how adopting Kyverno can enhance efficiency, simplify operations, centralize control, and reduce maintenance in Kubernetes environments. We'll demonstrate how Kyverno empowers organizations to effectively manage policies and tackle the unique challenges of large-scale Kubernetes deployments. Drawing from real-world experiences, we will share valuable lessons and best practices that facilitate seamless policy integration and management. Attendees will gain practical insights and tools to optimize their Kubernetes environments using Kyverno.
Speakers
avatar for Di Xu

Di Xu

CNCF Ambassador | Principle Software Engineer, Xiaohongshu
Currently, he works at Xiaohongshu leading a team focused on building a highly reliable and scalable container platform. He is the founder of CNCF Sandbox Project Clusternet. Also, he is a top 50 code contributor in Kubernetes community. He had spoken many times at open source conferences... Read More →
avatar for Xu Liu

Xu Liu

Senior Software Engineer, Xiaohongshu
Focused on the cloud native field, with extensive experience in managing large-scale Kubernetes clusters, container networking and serivcemesh.
Tuesday June 10, 2025 14:30 - 15:00 HKT
Level 19 | Crystal Court II
  Operations + Performance
  • Content Experience Level Any
  • Presentation Language Chinese

15:30 HKT

⚡ Lightning Talk: Achieving Unstoppable Stability: Deploying OceanBase Across Multiple Kubernetes Clusters - Peng Wang, OceanBase
Tuesday June 10, 2025 15:30 - 15:35 HKT
Distributed databases like OceanBase offer scalability and fault tolerance but can be challenging to manage in Kubernetes. Kubernetes is widely used for managing workloads, but deploying OceanBase on a single cluster creates a risk of failure. If the cluster fails, the entire database may become unavailable, which is problematic in production environments.

This talk will explore how deploying OceanBase across multiple Kubernetes clusters can solve this problem. Distributing the database across clusters ensures high availability and reduces the impact of a cluster failure. It also makes Kubernetes upgrades safer for operations teams.

We’ll cover the challenges of managing distributed databases in Kubernetes, like data consistency and load balancing. We’ll also show how multi-cluster deployments improve stability and resilience, making the solution stronger for critical applications. Attendees will learn how this architecture boosts fault tolerance and simplifies database management.
Speakers
avatar for Peng Wang

Peng Wang

Global Technical Evangelist, OceanBase
Peng Wang is the Global Technical Evangelist for OceanBase, a distributed relational database designed for cloud-native applications. He has over a decade of experience in the database industry, including his previous role as a team lead in Intel’s database R&D group.He is currently... Read More →
Tuesday June 10, 2025 15:30 - 15:35 HKT
Level 16 | Grand Ballroom I
  ⚡ Lightning Talks, Data Processing + Storage
  • Content Experience Level Any
  • Presentation Language Chinese

15:30 HKT

Smart GPU Management: Dynamic Pooling, Sharing, and Scheduling for AI Workloads in Kubernetes - Wei Chen, China Unicom Cloud Data & Mengxuan Li, Dynamia
Tuesday June 10, 2025 15:30 - 16:00 HKT
With the rapid growth of AI applications, optimal GPU utilization is essential, particularly in GPU sharing and job scheduling. Balancing performance, flexibility, and isolation is as challenging as the “Impossible Trinity”. Technologies such as vCUDA, MPS, and MIG are promising attempts, but each has its pros and cons. Managing clusters with multiple sharing techniques adds complexity due to differing resource names and configurations.
In this talk, we will demonstrate how to combine these methods easily. Users specify the memory and core count without managing GPU types or sharing methods. Based on user preferences and GPU resources, the best node and method will be selected. Requests are automatically translated into optimal profiles, and GPUs are dynamically partitioned.
This approach streamlines GPU management, enhances utilization, and improves scheduling. By integrating Volcano and HAMi, the solution strengthens GPU pooling and scheduling, optimizing AI workload management.
Speakers
avatar for Mengxuan Li

Mengxuan Li

Software Engineer, Dynamia Inc
Member of volcano community responsible for the development of gpu virtualization mechanism on volcano. It have been merged in the master branch of volcano, and will be released in v1.8. speaker, in OpenAtom Global Open Source Commit#2023
avatar for Wei Chen

Wei Chen

Technical expert, China Unicom Cloud Data Co., Ltd
I am a technical expert at China Unicom Cloud Data Co., Ltd, specializing in cloud computing infrastructure. I actively contribute to open-source projects, including KubeEdge, Openeular iSula, and Volcano.
Tuesday June 10, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court I
  AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

15:30 HKT

Revolutionizing Sidecarless Service Mesh With eBPF - Zhonghu Xu & Muyang Tian, Huawei
Tuesday June 10, 2025 15:30 - 16:00 HKT
It is widely recognized service meshes sidecar have introduced significant resource overhead, adversely affecting memory and CPU utilization. Farthermore, the tight coupling of sidecars with workloads complicates lifecycle management.

In this session, we will compare pros and cons of the main stream implement: Istio, Ambient and Cilium. But all use a userspace proxy per node, introducing a single point of failure and increasing connection numbers per hop. In this discussion, we aim to demonstrate how eBPF and programmable kernel modules can significantly mitigate these issues.

Lastly, we will introduce several use cases about adopting it to improve micro-service performance while minimizing the interruption on applications during infrastructure upgrades.
Speakers
MT

Muyang Tian

Operating System Engineer, Huawei
Operating system engineer of Huawei Technologies Co., Ltd., core member of Kmesh, contributor of libxdp. Enthusiastic about cloud native technology and eBPF-based high performance network.
avatar for Zhonghu Xu

Zhonghu Xu

Principal Software Engineer, Huawei
Zhonghu is an Istio Steering Committee member and has been an core maintainer of istio since 2018 and also istio TOP 3 contributors. He is also the CNCF TAG-Network Tech Lead. He is maintainer of many CNCF projects, istio, kmesh and volcano, etc. Also Kubernetes TOP 100 contributors... Read More →
Tuesday June 10, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court II
  Connectivity
  • Content Experience Level Any
  • Presentation Language Chinese

15:51 HKT

⚡ Lightning Talk: Best Practices for Upgrading Service Mesh Seamlessly - Hang Yin, Alibaba Cloud & Zhencheng Lee, Huawei Technologies
Tuesday June 10, 2025 15:51 - 15:56 HKT
Service Mesh is thriving, with new versions always incorporating exciting features and significant CVE fixes that bring considerable benefits to users. However, the disruption of service traffic caused by Service Mesh upgrades or restarts, leading to system instability, remains a major obstacle to the usage of Service Mesh in production. In the most mature sidecar model, upgrading the data plane of the service mesh results in the redeployment of services; in some cases, this is nearly unacceptable, as certain business applications may face substantial cold start costs . Even for the rising sidecarless mode, it is still necessary to address the issue of interrupting existing user connections, which requires difficult choices. This topic will begin with real-world case studies, where technical experts from Huawei Cloud and Alibaba Cloud will share practical experiences on seamless service mesh upgrades in real production scenarios with the users.
Speakers
avatar for Hang Yin

Hang Yin

Senior R&D Engineer, Alibaba Cloud
Hang Yin, senior engineer of Alibaba Cloud, focusing on Kubernetes, service mesh and other cloud native fields. Currently served in the Alibaba Cloud Service Mesh (ASM) team, responsible for core abilities of ASM such as performance improvement, ecosystem and Mesh Topology.
avatar for Zhencheng Lee

Zhencheng Lee

Huawei Cloud Senior R&D engineers, Huawei Technologies Co., Ltd.
Senior Engineer at Huawei Cloud, specializes in Kubernetes, service mesh, and other cloud-native technologies. I am the primary developer and maintainer of the CNCF project Kmesh and actively contribute to several other CNCF projects, with a particular emphasis on service mesh and... Read More →
Tuesday June 10, 2025 15:51 - 15:56 HKT
Level 16 | Grand Ballroom I
  ⚡ Lightning Talks, Connectivity

15:58 HKT

⚡ Lightning Talk: Deep Dive Into Kernel Requirements: Strengthening Cloud Native With New Kernel Features - Qifeng Guo, DaoCloud
Tuesday June 10, 2025 15:58 - 16:03 HKT
- Kubernetes 1.31: Moving cgroup v1 Support into Maintenance Mode: making cgroup v2 (kernel 5.8+) a key requirement.
- Linux Kernel Version Requirements shows kernel requirements of Kubernetes features
- eBPF and Modern Networking and observibility

This talk will provide a detailed look at the kernel version requirements for Kubernetes, with a focus on evolving trends in AI infrastructure, SIG-Node, and SIG-Network. We will explore how different kernel versions influence Kubernetes cluster operations, especially in the areas of network performance, resource management, and security enhancements. This session will also highlight some of the rising star projects in the cloud-native ecosystem, including Cilium, Falco, Pyroscope, Kepler and DeepFlow.


Key Topics:
- AI Infrastructure(device related)
- Kubernetes SIG-Node(cgroup)
- Kubernetes SIG-Network(nftables)
- eBPF-based Projects requirements
- Is kernel version checked enough?
- Dependencies/Ecosystem Maintenance
Speakers
avatar for Qifeng Guo

Qifeng Guo

Software Engineer, Daocloud
I'm a software developer from DaoCloud, China, and a Kubernetes contributor. Outside work, I'm often active in Kubernetes Networking, including Kube-Proxy, Calico, Cilium, Metallb, and more.
Tuesday June 10, 2025 15:58 - 16:03 HKT
Level 16 | Grand Ballroom I
  ⚡ Lightning Talks, Cloud Native Experience
  • Content Experience Level Any
  • Presentation Language Chinese

16:26 HKT

⚡ Lightning Talk: Kubernetes Isekai (異世界):Transforming Kubernetes Education Into a Gamified Adventure - Cyrus Wong & Hongyi Qian, Hong Kong Institute of Information Technology
Tuesday June 10, 2025 16:26 - 16:31 HKT
Kubernetes Isekai (異世界) is an open-source RPG designed for hands-on Kubernetes learning through gamification. Ideal for junior to Higher Diploma students at Hong Kong Institute of Information Technology (HKIIT), it transforms Kubernetes education into an engaging adventure.

Role-Playing Adventure: Students interact with NPCs who assign Kubernetes tasks.
Task-Based Learning: Tasks involve setting up and managing Kubernetes clusters.
Free Access: Uses AWS Academy Learner Lab with Minikube or Kubernetes.
Scalable Grading: AWS SAM application tests Kubernetes setups within AWS Lambda.
Progress Tracking: Students track progress and earn rewards.
This game offers practical Kubernetes experience in a fun, cost-effective way.
GenAI Chat: Integrates Generative AI to make NPC interactions more dynamic and fun, enhancing the overall learning experience.
Demo
https://www.youtube.com/watch?v=dIwNWwz681k
Speakers
avatar for Cyrus Wong

Cyrus Wong

Senior Lecturer, Hong Kong Institute of Information Technology
Cyrus Wong is an accomplished senior lecturer who oversees the Higher Diploma program in Cloud and Data Centre Administration at the Hong Kong Institute of Information Technology (HKIIT) in Hong Kong. He is a passionate advocate for the adoption of cloud technology across various... Read More →
avatar for Hongyi Qian

Hongyi Qian

Cloud major student, Hong Kong Institute of Information Technology at IVE(Lee Wai Lee)
I am pursuing a Higher Diploma in Cloud and Data Centre Administration at the Hong Kong Institute of Information Technology at IVE (Lee Wai Lee) and am currently interning at Cathay Pacific Airways. This project teaches Kubernetes concepts and commands in a gamified way. By turning... Read More →
Tuesday June 10, 2025 16:26 - 16:31 HKT
Level 16 | Grand Ballroom I
  ⚡ Lightning Talks, Cloud Native Novice

16:40 HKT

⚡ Lightning Talk: Stateful Service Federation in Large-Scale Search, Ads, and Recommendation Scenarios at Xiaohongshu - Yang Song & Vec Sun, Xiaohongshu
Tuesday June 10, 2025 16:40 - 16:45 HKT
Search, advertising, and recommendation services are among the primary business types within Xiaohongshu. Due to the strong dependency of these services on index table, each instance replica needs to maintain its own independent state. As a result, such services are deployed using the stateful workload.
With the rapid growth of Xiaohongshu's business scale, the size limit of a single Kubernetes cluster has made it impossible to further scale stateful services. To address daily traffic and business growth, the only solution was to migrate workloads to idle clusters. However, this migration approach has caused significant inconvenience and risks for developer.
To tackle this challenge, Xiaohongshu leveraged Karmada to implement the federation of stateful services. By designing scheduling and deployment capabilities for stateful services on federated clusters, This approach has seamlessly resolved the scaling limitations caused by single-cluster capacity constraints for stateful services.
Speakers
avatar for Vec Sun

Vec Sun

CloudNative Developer, Xiaohongshu
Sunweixiang has previously worked in the Alibaba Cloud container team as software engineer and is a contributor to the OpenKruise community's main, Karmada, and other communities. He is deeply involved in container application orchestration, multi-cluster.
avatar for Yang Song

Yang Song

Software Engineer, xiaohongshu
Song Yang is a Cloud Native Development Engineer at Xiaohongshu, currently working on multi-cluster and Kubernetes scheduler. He is a maintainer of the CNCF incubating project KubeVela.
Tuesday June 10, 2025 16:40 - 16:45 HKT
Level 16 | Grand Ballroom I
  ⚡ Lightning Talks, Application Development

16:47 HKT

⚡ Lightning Talk: Mastering Prefill-Decode-Disaggregated Architecture: Solutions and Best Practices in Alibaba Cloud - Jing Gu & Yang Che, Alibaba Cloud
Tuesday June 10, 2025 16:47 - 17:52 HKT
Disaggregating the prefill and decoding phases in LLM inference has garnered significant attention in the industry because it can enhance performance. Several solutions have been developed, including Mooncake, TetriInfer, Splitwise, DistServe, and RTP-LLM. However, deploying a disaggregation LLM inference at scale on Kubernetes, while evaluating its performance and cost benefits presents numerous challenges.
In this talk, we will introduce a solution that uses a LeaderWorkerSet as the workload, an Ingress Controller and a node discovery service. It can deploy disaggregated PD on Kubernetes, supporting multiple LLM inference engines like Mooncake and RTP-LLM with zero intrusion. Furthermore, we will discuss improving load balancing using Envoy and ORCA, based on KVCache and metrics, and recommending optimal ratios for the PD phases. Finally, we will cover essential features for production deployment such as high availability, elastic scaling, canary releases, and observability.
Speakers
avatar for Yang Che

Yang Che

senior software engineer, Alibaba Cloud
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →
avatar for Jing Gu

Jing Gu

Software Engineer, Alibaba Cloud
Jing Gu is a senior engineer at Alibaba Cloud. She works on Alibaba Cloud Container Service for Kubernetes , focusing on serving large language models (LLMs) within Kubernetes and optimizing LLM inference processes.
Tuesday June 10, 2025 16:47 - 17:52 HKT
Level 16 | Grand Ballroom I
  ⚡ Lightning Talks, AI + ML

17:00 HKT

Portrait Service: AI-Driven PB-Scale Data Mining for Cost Optimization and Stability Enhancement - Yuji Liu & Zhiheng Sun, Kuaishou
Tuesday June 10, 2025 17:00 - 17:30 HKT
Kuaishou's Kubernetes-based platform manages 200,000+ machines and 10M+ Pods, generating 10TB+ daily data. AI-driven intelligent portrait service enhances stability and performance:
● Stability Management: AI analyzes system and workload metrics to generate machine health scores, integrated into Kubernetes scheduling to evict/avoid unhealthy nodes. This reduced pod creation delays from 20 to 0.1 cases/day and boosted service availability from 90% to 99.99%.
● Performance Optimization:
Serving 10,000+ services with diverse resource sensitivities (compute-, cache-, and IO-intensive), we combine AI with microarchitecture data to pinpoint bottlenecks and create application profiles. Optimizing resource allocation (compute, cache, memory bandwidth) has increased average IPC by 20% and reduced LLC miss rates for cache-sensitive services from over 50% to 10%.
Future plans include integrating AI Agent technology to automate anomaly detection and reduce manual operations by 80%.
Speakers
avatar for Yuji Liu

Yuji Liu

Software Engineer, Kuaishou Technology
Container cloud engineer from Kuaishou.
avatar for Zhiheng Sun

Zhiheng Sun

Senior Software Engineer, Kuaishou
I am a cloud-native engineer at kwaishou, specializing in application performance improvement on Kubernetes. I also have led the open-local, a cloud-native local storage project in the open-source community.
Tuesday June 10, 2025 17:00 - 17:30 HKT
Level 19 | Crystal Court I
  AI + ML

17:00 HKT

Unlocking the Power of CEL for Advanced Multi-Cluster Scheduling - Qing Hao & Jian Qiu, Red Hat
Tuesday June 10, 2025 17:00 - 17:30 HKT
The Common Expression Language (CEL) is a powerful solution already used in the Kubernetes API, with the recent Kubernetes v1.32 highlighting it for mutating admission policies. It is also used in Envoy and Istio. This topic will explore the benefits and features that CEL can offer for multi-cluster scheduling.

There is a growing demand for granular and customizable requirements in scheduling. For example, users may want to filter clusters with the label "version" > v1.30.0 instead of listing all versions. Many also wish to use their CRD fields or metrics for scheduling. CEL's extensibility effectively addresses these challenges as it can handle complex expressions.

In this talk, we will showcase how Open Cluster Management (OCM) leverages CEL in multi-cluster scheduling. Using the ClusterProfile API as an example, we will demonstrate how CEL meets complex scheduling needs and illustrate its potential to improve GPU utilization for AI applications by solving bin-packing challenges.
Speakers
avatar for Jian Qiu

Jian Qiu

Senior Principal Software Engineer, RedHat
Qiu Jian is a developer at Redhat mainly focusing on multiple cluster management.
avatar for Qing Hao

Qing Hao

Senior Software Engineer, Red Hat
Qing Hao is a Senior Software Engineer at Red Hat, where she works as the maintainer of Open Cluster Management. She is also the CNCF Ambassador, the speaker at KubeCon China 2024, and the mentor for OSPP 2022 and GSoC 2024. Qing focuses on solving complex challenges... Read More →
Tuesday June 10, 2025 17:00 - 17:30 HKT
Level 19 | Crystal Court II
  Emerging + Advanced
  • Content Experience Level Any
  • Presentation Language Chinese
 
Wednesday, June 11
 

09:12 HKT

Keynote: Optimizing AI Workload Scheduling: Bilibili's Journey To an Efficient Cloud Native AI Platform - Long Xu, Bilibili & Kevin Wang, Huawei
Wednesday June 11, 2025 09:12 - 09:22 HKT
As China's leading video platform, Bilibili faces 4 key challenges in multi-cluster AI workloads management:
1. Workload Diversity: Training/inference/video processing workloads have different scheduling requirements.
2. Cross-Cluster Complexity: Managing workloads across multiple Kubernetes clusters in expanding IDCs with SLAs.
3. Performance Demands: Minimal startup latency and best scheduling efficiency for short-running tasks e.g. video processing.
4. Efficiency-QoS Balance: maximizing resource utilization while ensuring priority workload stability.

This talk will share experiences and delve specific optimization techniques:
1. Leveraging and optimizing CNCF projects such as Karmada and Volcano to build a unified, high-performance AI workload scheduling platform.
2. Integrating technologies such as KubeRay to schedule various AI online and offline workloads.
3. Maximizing resource efficiency through online and offline hybrid scheduling, tidal scheduling and other technologies.
Speakers
avatar for Kevin Wang

Kevin Wang

Technical Expert, Lead of Cloud Native Open Source, Huawei
Kevin Wang has been an outstanding contributor in the CNCF community since its beginning and is the leader of the cloud native open source team at Huawei. Kevin has contributed critical enhancements to Kubernetes, led the incubation of the KubeEdge, Volcano, Karmada projects in CNCF... Read More →
avatar for Long Xu

Long Xu

Senior Software Engineer, Bilibili
Long Xu is a Senior Software Engineer in the Infrastructure Department at Bilibili. He has rich experiences in the Kubernetes field, including scheduling, autoscaling and system stability.
Wednesday June 11, 2025 09:12 - 09:22 HKT
Level 16 | Grand Ballroom I
  Keynote Sessions, AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

09:36 HKT

Keynote: Who Owns Your Pod? Observing and Blocking Unwanted Behavior at eBay With eBPF - Jianlin Lv, eBay & Liyi Huang, Isovalent at Cisco
Wednesday June 11, 2025 09:36 - 09:46 HKT
Kubernetes admins often struggle to understand pod activities, both for regular pods and those with various privileges. This session explores two use cases that highlight why Tetragon, an eBPF-based observability and enforcement tool, for pod security:
1.Replacing Auditbeat with Tetragon: Learn how Auditbeat rules mapped to Tetragon tracing policies, identifying functionality gaps, and how eBay contributed back to the community
2.Auditing Container Process Permissions: See how Tetragon helped analyze pod behavior and determine if applications could migrate to more restrictive pod security policies, ensuring adherence to the principle of least privilege
We also cover deployment challenges, such as integrating with SIEM platforms, resource utilization, and implementing runtime enforcement for unwanted pod behavior. This talk provides practical insights into using Tetragon for observability, policy refinement, and improving overall pod security posture in Kubernetes environments.
Speakers
avatar for Jianlin Lv

Jianlin Lv

Senior Linux Kernel Development Engineer, eBay
https://www.linkedin.com/in/jianlin-lv-25650141/
avatar for Liyi Huang

Liyi Huang

customer success architect, Isovalent at Cisco
senior solution architect @isovalent.com
Wednesday June 11, 2025 09:36 - 09:46 HKT
Level 16 | Grand Ballroom I
  Keynote Sessions, Observability

09:48 HKT

Keynote: How We Save $900 per Day with Self-Hosted AI: Building Scalable Local LLM Infrastructure - Vivian Hu, Product Manager, Second State & Lv Yi, CTO, 5miles
Wednesday June 11, 2025 09:48 - 09:58 HKT
While SaaS AI providers like OpenAI offer convenient LLM services, they come with significant drawbacks: high costs, lack of customization, lack of privacy, and usage limitations that can throttle high-volume applications.

This presentation shows how a leading e-commerce web site deployed a highly customized suite of LLM applications on private cloud infra, reducing costs by 90% while maintaining complete control over scalability and quality of service. We'll discuss the technology stack for orchestrating inference workloads on cloud GPUs, and explore practical strategies for building stable, scalable, high-performance AI apps on your own private cloud infra.
Speakers
avatar for Lv Yi

Lv Yi

CTO, 5miles
Lv Yi is the CTO of 5miles, a leading e-commerce platform in the United States. With 19 years in IT, he is a cloud native enthusiast who previously served as a mobile business expert at AsiaInfo. In 2012, he led Zhangyue's systems evolution toward microservices architecture. At 5miles... Read More →
avatar for Vivian Hu

Vivian Hu

Product Manager, Second State
Vivian Hu is a Product Manager at Second State and a columnist at InfoQ. She is a founding member of the WasmEdge project. She organizes Rust and WebAssembly community events in Asia.
Wednesday June 11, 2025 09:48 - 09:58 HKT
Level 16 | Grand Ballroom I
  Keynote Sessions

10:00 HKT

Keynote: Building a Large Model Inference Platform for Heterogeneous Chinese Chips Based on VLLM - Haiwen Zhang, China Mobile & Kante Yin, DaoCloud
Wednesday June 11, 2025 10:00 - 10:10 HKT
With the growing demand for heterogeneous computing power, Chinese users are gradually adopting domestic GPUs, especially for inference. vLLM, the most popular open-source inference project, has drawn widespread attention but does not support domestic chips.Chinese inference engines are still developing in functionality, performance, and ecosystem. In this session, we’ll introduce how to adapt vLLM to support domestic GPUs,enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill. We’ll also cover performance bottleneck analysis and chip operator development to maximize hardware potential.
Additionally, Kubernetes has become the standard for container orchestration and is the preferred platform for inference services. We’ll show how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with a few lines of code, and explore how llmaz handles heterogeneous GPU scheduling and our practices for monitoring and elastic scaling.
Speakers
avatar for Haiwen Zhang

Haiwen Zhang

Senior Software Engineer, China Mobile (Suzhou) Software Technology Co., Ltd.
The author has rich experience in cloud-native and AI inference development, currently works at China Mobile, focusing on the research and development of cloud-native and AI inference related products. He shared experiences of service mesh at some technical conferences such as the... Read More →
avatar for Kante Yin

Kante Yin

Software Engineer, DaoCloud
Kante is a senior software engineer and an open source enthusiast from DaoCloud, his work is mostly around scheduling, resource management and LLM inference. He actively contributes to upstream Kubernetes as SIG-Scheduling Maintainer and helps in incubating several projects like Kueue... Read More →
Wednesday June 11, 2025 10:00 - 10:10 HKT
Level 16 | Grand Ballroom I
  Keynote Sessions, AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

11:45 HKT

China Mobile's Panji Platform: Observability Practices and Implementations for LLM Applications Base - Jing Shang, China Mobile & Casey Li, Yunshan Networks, Inc.
Wednesday June 11, 2025 11:45 - 12:15 HKT
As large language model (LLM) applications are widely deployed, their complex architectures challenge business observability. APM probes, which rely on instrumentation or proxy operation, consume system resources and impact traffic and performance, restricting their use in complex scenarios. Also, multiple teams handling different LLM instances make it hard to coordinate unified observability construction.
To solve this, China Mobile‘'s Panji platform collaborates with DeepFlow to achieve zero-intrusion (Zero Code) and full-stack (Full Stack) observability instantly, using eBPF and Wasm technologies. eBPF collects real-time data at the kernel level, while Wasm plugins parse streaming requests. By integrating existing data, the platform provides service universal map, distributed tracing, and multi-dimensional metric analysis, ensuring the stability and performance optimization of LLM applications.
Speakers
avatar for Jing Shang

Jing Shang

Chief Expert of China Mobile Group, China Mobile
Dr. Shang Jing, Chief Expert at China Mobile Group, has over 20 years of experience in IT system development, construction, and operation. Specializing in big data and cloud technologies, she led the development of China Mobile's Wutong Big Data Platform. Under her leadership, the... Read More →
avatar for Casey Li

Casey Li

Product Manager, Yunshan Networks, Inc.
Starting from graduate school at Huazhong University of Science and Technology in 2013, I joined Tencent Cloud virtual network team in 2016, which provided me with in-depth theoretical knowledge and practical experience in cloud networks. In 2018, I joined YUNSHAN Networks as PM... Read More →
Wednesday June 11, 2025 11:45 - 12:15 HKT
Level 16 | Grand Ballroom I
  Observability

13:45 HKT

Solidigm CSAL Solution Brings Advanced IO Shaping, Caching and Data Placement Into NVIDIA DPU DOCA S - Wayne Gao, Solidigm & Long Chen, NVIDIA
Wednesday June 11, 2025 13:45 - 14:15 HKT
CSAL is Cloud Storage Acceleration Layer for BigData and AI. it is open-source user mode FTL, cache and io trace component inside SPDK(upstreamed). It commercially helps Alibaba cloud storage system.
refer https://www.solidigm.com/products/technology/cloud-storage-acceleration-layer-write-shaping-csal.html. Alibaba and Solidigm joint top computer conference paper Eurosys2024 https://dl.acm.org/doi/pdf/10.1145/3627703.3629566
Session Topics:
This session is joint development with NVIDIA DPU team and BeeGFS
1. CSAL leverage DPU DRAM as CSAL write buffer who achieve best storage latency ever also promise the data consistency.
2. QLC high density storage is favorable by AI industry since it save power and space for AI Data Center. DPU storage solution can achieve same thing, it is great combine two things together.
3. CSAL bring advanced storage IO shaping, caching and data placement SW into NVIDIA DPU DOCA storage SW service,
4. DPU and CSAL and BeeGFS experiment data sharing and report
Speakers
avatar for Long Chen

Long Chen

Director, NVIDIA
Take charge of promoting NVIDIA networking for high speed storage and new application market in China
avatar for Wayne Gao

Wayne Gao

Princinple storage solution architect, Solidigm
Wayne Gao is a Principal Engineer as Storage solution architect and worked on CSAL from PF to Alibaba commercial release. Wayne also takes main developer effort to finish CSAL pmem/DSA and cxl.mem PF from intel to Solidigm. Before joining Intel, Wayne has over 20 years of storage... Read More →
Wednesday June 11, 2025 13:45 - 14:15 HKT
Level 19 | Crystal Court II
  Data Processing + Storage

13:45 HKT

Connecting Dots: Unified Hybrid Multi-Cluster Auth Experience With SPIFFE and Cluster Inventory API - Chen Yu, Microsoft & Jian Zhu, Red Hat
Wednesday June 11, 2025 13:45 - 14:15 HKT
As the multi-cluster pattern continues to evolve, managing K8s identities, credentials, and permissions for teams and multi-cluster apps, such as Argo and Kueue, has become a hassle, typically involving managing individual service accounts on each cluster and passing credentials around. Such setup is often scattered, repetitive, difficult to track/audit, and may impose security and ops complications. This is especially true with hybrid environments, where different solutions could be in play across platforms.

This demo presents a solution based on OpenID, SPIFFE/SPIRE, and Cluster Inventory API from the Multi-Cluster SIG that provides a unified, seamless, and secure auth experience. Facilitated by CNCF multi-cluster projects, OCM and KubeFleet, attendees could be inspired to leverage open source solutions to eliminate credential sprawl, reduce operational complexity, and enhance security in hybrid cloud environments, when setting up teams/applications to access a multi-cluster setup.
Speakers
avatar for Chen Yu

Chen Yu

Senior Software Engineer, Microsoft
Chen Yu is a senior software engineer at Microsoft with a keen interest in cloud-native computing. He is currently working on Multi-Cluster Kubernetes and contributing to the Fleet project open-sourced by Azure Kubernetes Service.
avatar for Jian Zhu

Jian Zhu

Senior Software Engineer, RedHat
Zhu Jian is a senior software engineer at RedHat, a speaker at Kubecon China 2024, and a core contributor to the open cluster management project. Jian enjoys solving multi-cluster workload distribution problems and extending OCM with add-ons.
Wednesday June 11, 2025 13:45 - 14:15 HKT
Level 16 | Grand Ballroom I
  Security

14:30 HKT

Exploring KubeEdge Graduation: Build a Diverse and Collaborative Open Source Community From Scratch - Yue Bao & Fei Xu, Huawei; Hongbing Zhang, DaoCloud; Huan Wei, Hangzhou HarmonyCloud; Benamin Huo, QingCloud
Wednesday June 11, 2025 14:30 - 15:00 HKT
Recently, the health of open-source projects, particularly, vendor diversity and neutrality, has become a key topic of discussion. Many projects have faced challenges due to a lack of vendor diversity, threatening their sustainability. It is increasingly clear that setting up the right governance structure and project team during a project’s growth is critical.
KubeEdge, the industry's first cloud-native open-source edge computing project, has grown from its initial launch in 2018 to achieving CNCF graduation this year. Over the past few years, KubeEdge has evolved from a small project into a diverse, collaborative and multi-vendor open-source community
In this panel, we will discuss the lessons learned from KubeEdge community graduation journey, focusing on key strategies in technical planning, community governance, developer growth, and project maintenance. Join us to explore how to build a multi-vendor and diverse community, and how to expand into different industries.
Speakers
avatar for Huan Wei

Huan Wei

Senior Technical Director, Hangzhou HarmonyCloud Technologies Co., Ltd
Huan is an open source enthusiast and cloud native technology advocate. He is currently the CNCF ambassador, and TSC member of KubeEdge project. He is serving as experienced technical director for HarmonyCloud.
avatar for Fei Xu

Fei Xu

Senior software Engineer, Huawei
KubeEdge TSC Member, Senior Software Engineer at Huawei Cloud. Focusing on Cloud Native,Kubernetes, Service Mesh, EdgeComputing, EdgeAI and other fields. Currently maintaining the KubeEdge project which is a CNCF graduated project. And has rich experience in Cloud Native and EdgeComputing... Read More →
avatar for Benjamin Huo

Benjamin Huo

KubeSphere founding member, KubeEdge TSC member, Director of Cloud Platform, QingCloud Technologies
Benjamin Huo leads QingCloud Technologies' Architect team and Observability Team. He is the founding member of KubeSphere and the co-author of Fluent Operator, Kube-Events, Notification Manager, OpenFunction, and most recently eBPFConductor. He loves cloud-native technologies especially... Read More →
avatar for Yue Bao

Yue Bao

Senior Software Engineer, Huawei Cloud Computing Technology Co., Ltd.
Yue Bao serves as a software engineer of Huawei Cloud. She is now working 100% on open source, focusing on lightweight edge for KubeEdge. She is the maintainer of KubeEgde and also the tech leader of KubeEdge SIG Release and Node. Before that, Yue worked on Huawei Cloud Intelligent... Read More →
avatar for Hongbing Zhang

Hongbing Zhang

KubeEdge TSC Member, Chief Operating Officer, DaoCloud
Hongbing Zhang is Chief Operating Officer of DaoCloud. He is a veteran in open source areas, he founded IBM China Linux team in 2011 and organized team to make significant contributions in Linux Kernel/openstack/hadoop projects. Now he is focusing on cloud native domain and leading... Read More →
Wednesday June 11, 2025 14:30 - 15:00 HKT
Level 19 | Crystal Court II
  Cloud Native Experience
  • Content Experience Level Any
  • Presentation Language Chinese

15:30 HKT

Stability in Large Model Training: Practices in Software and Hardware Fault Self-Healing - Yang Cao, Ant Group
Wednesday June 11, 2025 15:30 - 16:00 HKT
Training trillion-parameter AI models requires significant GPU resources, where any idle time leads to increased costs. Maintaining full-speed GPU utilization is crucial, yet hardware and software failures (such as firmware, kernel, or hardware issues) often disrupt large-scale training. For example, LLaMA3 experienced 419 interruptions over 54 days, with 78% due to hardware issues, underscoring the necessity for automated anomaly recovery.
At Ant Group, we will share:
GPU Monitoring: Comprehensive monitoring from hardware to applications to ensure optimal performance.
Self-Healing for Large GPU Clusters: Automated fault isolation, recovery from kernel panics, and node reprovisioning for clusters with 10,000+ GPUs.
Core Service Level Objectives (SLOs): Achieving over 98% GPU availability and more than 90% automatic fault isolation.
Predictive Maintenance: Using failure pattern analysis to reduce downtime and improve reliability.
Speakers
avatar for Yang Cao

Yang Cao

senior engineer, Ant Group
Yang Cao Senior Engineer, Ant Group Yang Cao is a senior engineer at Ant Group, currently focusing on ensuring the stability of large-scale distributed training on Kubernetes.
Wednesday June 11, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court II
  Cloud Native Experience

16:15 HKT

High-Performance Cloud Native Traffic Authentication Solutions - Muyang Tian & Zhonghu Xu, Huawei
Wednesday June 11, 2025 16:15 - 16:45 HKT
In the rapidly evolving landscape of cloud computing and microservices architecture, efficiently and securely managing communication between services has become a critical challenge. Traditional methods of network traffic authentication often become a performance bottleneck, especially when handling large-scale data flows. This session introduces an innovative solution — leveraging Linux kernel technology XDP (eXpress Data Path) to achieve efficient traffic authentication for service-to-service communications.

We will delve into how to use XDP for rapid filtering and processing of packets before they enter the system's protocol stack, significantly reducing latency and enhancing overall system throughput. Additionally, we will share practical application experiences from projects such as Kmesh, including but not limited to performance tuning, security considerations, and integration with other network security strategies.
Speakers
MT

Muyang Tian

Operating System Engineer, Huawei
Operating system engineer of Huawei Technologies Co., Ltd., core member of Kmesh, contributor of libxdp. Enthusiastic about cloud native technology and eBPF-based high performance network.
avatar for Zhonghu Xu

Zhonghu Xu

Principal Software Engineer, Huawei
Zhonghu is an Istio Steering Committee member and has been an core maintainer of istio since 2018 and also istio TOP 3 contributors. He is also the CNCF TAG-Network Tech Lead. He is maintainer of many CNCF projects, istio, kmesh and volcano, etc. Also Kubernetes TOP 100 contributors... Read More →
Wednesday June 11, 2025 16:15 - 16:45 HKT
Level 19 | Crystal Court II
  Security
  • Content Experience Level Any
  • Presentation Language Chinese
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.