The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.
Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.
Sign up or log in to add sessions to your schedule and sync them to your phone or calendar.
Distributed databases like OceanBase offer scalability and fault tolerance but can be challenging to manage in Kubernetes. Kubernetes is widely used for managing workloads, but deploying OceanBase on a single cluster creates a risk of failure. If the cluster fails, the entire database may become unavailable, which is problematic in production environments.
This talk will explore how deploying OceanBase across multiple Kubernetes clusters can solve this problem. Distributing the database across clusters ensures high availability and reduces the impact of a cluster failure. It also makes Kubernetes upgrades safer for operations teams.
We’ll cover the challenges of managing distributed databases in Kubernetes, like data consistency and load balancing. We’ll also show how multi-cluster deployments improve stability and resilience, making the solution stronger for critical applications. Attendees will learn how this architecture boosts fault tolerance and simplifies database management.
Peng Wang is the Global Technical Evangelist for OceanBase, a distributed relational database designed for cloud-native applications. He has over a decade of experience in the database industry, including his previous role as a team lead in Intel’s database R&D group.He is currently... Read More →
As AI/ML workloads continue to scale in complexity, developers and platform engineers are pushing Kubernetes beyond typical MLOps boundaries.
This talk dives into strategies for orchestrating GPU-accelerated training and inference across large-scale clusters -integrating HPC principles, operator-based scheduling, and novel debugging workflows.
Attendees will learn how to implement fine-grained GPU partitioning, harness ephemeral containers to probe and adjust multi-node training in real time, and adopt eBPF-driven instrumentation for low-overhead kernel-level performance insights. We’ll explore cutting-edge scheduling optimizations—like reinforcement-learning approaches and HPC-inspired batch-queuing orchestration on Kubernetes that dynamically respond to heterogeneous job demands.
Real-world case studies will highlight HPC integration scenarios (RDMA, GPU Direct) for data-parallel workloads and complex training frameworks such as Horovod, Ray, and Spark on Kubernetes.
Principal Technical Solutions Architect, Akamai Technologies
Brandon Kang is a Principal Technical Solutions Architect at Akamai Technologies, specializing in cloud-native projects across Asia as a compute specialist.Before joining Akamai, he served as a Lead Software Engineer at Samsung, a Senior Program Manager at Microsoft, and a Service... Read More →
In this Lightning Talk, we’ll dive into K8sGPT, a CNCF sandbox project that uses AI to enhance Kubernetes management. K8sGPT leverages LLMs to diagnose cluster issues, offering root cause analysis and solutions in simple terms. It encodes SRE expertise into analyzers, extracting key insights and enriching them with AI-powered explanations. Key highlights: - Core Features: Learn to use the CLI and K8sGPT Operator for cluster error analysis and contextualized insights. - AI Integration & Security: Explore integration with AI models like OpenAI, Azure, and Ollama, with data anonymization for security. - Real-world Demos: See how K8sGPT simplifies Kubernetes troubleshooting. - Enterprise Strategies: Discover techniques like LoRA and RAG to tailor K8sGPT for specific environments. Whether you're new to Kubernetes or an expert, K8sGPT can streamline cluster management, reduce troubleshooting time, and boost efficiency.
Kay Yan is kubespray maintainer, containerd/nerdctl maintainer. He is the Principal Software Engineer in DaoCloud, and develop the DaoCloud Enterprise Kubernetes Platform since 2016.
Service Mesh is thriving, with new versions always incorporating exciting features and significant CVE fixes that bring considerable benefits to users. However, the disruption of service traffic caused by Service Mesh upgrades or restarts, leading to system instability, remains a major obstacle to the usage of Service Mesh in production. In the most mature sidecar model, upgrading the data plane of the service mesh results in the redeployment of services; in some cases, this is nearly unacceptable, as certain business applications may face substantial cold start costs . Even for the rising sidecarless mode, it is still necessary to address the issue of interrupting existing user connections, which requires difficult choices. This topic will begin with real-world case studies, where technical experts from Huawei Cloud and Alibaba Cloud will share practical experiences on seamless service mesh upgrades in real production scenarios with the users.
Hang Yin, senior engineer of Alibaba Cloud, focusing on Kubernetes, service mesh and other cloud native fields. Currently served in the Alibaba Cloud Service Mesh (ASM) team, responsible for core abilities of ASM such as performance improvement, ecosystem and Mesh Topology.
Senior Engineer at Huawei Cloud, specializes in Kubernetes, service mesh, and other cloud-native technologies. I am the primary developer and maintainer of the CNCF project Kmesh and actively contribute to several other CNCF projects, with a particular emphasis on service mesh and... Read More →
- Kubernetes 1.31: Moving cgroup v1 Support into Maintenance Mode: making cgroup v2 (kernel 5.8+) a key requirement. - Linux Kernel Version Requirements shows kernel requirements of Kubernetes features - eBPF and Modern Networking and observibility
This talk will provide a detailed look at the kernel version requirements for Kubernetes, with a focus on evolving trends in AI infrastructure, SIG-Node, and SIG-Network. We will explore how different kernel versions influence Kubernetes cluster operations, especially in the areas of network performance, resource management, and security enhancements. This session will also highlight some of the rising star projects in the cloud-native ecosystem, including Cilium, Falco, Pyroscope, Kepler and DeepFlow.
Key Topics: - AI Infrastructure(device related) - Kubernetes SIG-Node(cgroup) - Kubernetes SIG-Network(nftables) - eBPF-based Projects requirements - Is kernel version checked enough? - Dependencies/Ecosystem Maintenance
I'm a software developer from DaoCloud, China, and a Kubernetes contributor. Outside work, I'm often active in Kubernetes Networking, including Kube-Proxy, Calico, Cilium, Metallb, and more.
Tech startup in early stage normally aim low running cost on infrastructure spend but fast development and delivery. When there are a first few clients onboard, disaster recovery plan is a must have. When DR is required and an agreed RTO is 6 hours for example, how to not only remain low running cost but also to meet agreed RTO and SLA, our DR plan and implementation is a success to share with the audience. We onboarded container orchestration platform Kubernetes, DevOps best practices, for example Infrastructure-and-Configuration-as-Code and Pipeline-as-Code. Our DR implementation only spends a minimum cost on always-on resources. When a DR incident happens, automated pipelines will bring up on-demand resources that include a Kubernetes cluster, and geo-recover database and storage, then deploy the latest applications into kubernetes cluster, production DR can be live within 2 hours.
As a Senior DevSecOps Engineer at KPMG Australia, I have been leading the cloud operations and security for Origins, a blockchain-based SaaS solution for supply chain traceability, since May 2022. I have brought the best practices of DevSecOps into day-to-day development and delivery... Read More →
Discover how the LF Energy working group is driving innovation in sustainable living with the Open Renewable Energy Systems (ORES) project. This session will explore how ORES leverages cloud-native technologies to build an open architecture, open standards, and APIs for software-defined home energy networks. By embracing Kubernetes and other cloud-native principles, ORES enables seamless integration of renewable energy sources, energy storage, and smart devices for a future-proof, scalable, and sustainable energy ecosystem. Learn how ORES promotes collaboration, interoperability, and innovation to shape the next generation of energy solutions in the cloud-native era.
Chris Xie, Head of Open Source Strategy at Futurewei, is a prominent advocate for global open source collaboration. With a background that includes roles at both Fortune 500 companies and startups, he brings a unique combination of technical and strategic business expertise. Recently... Read More →
As we see, organizations are investing heavily in bringing AI accelerators into their data centers or using them on the public cloud but continue to struggle with the cost-effective and efficient management of these critical resources. There are some existing approaches to address them but heavy and inflexible. Here, we'd like to take this chance to review if-how we can address the challenges of expensive and limited machine learning compute resources like GPU and identifies solutions for GPU fractional optimization with our technical PoC - GPU.x by transparent backend Python hooker within ML upstream frameworks running Kubernetes. It's lightweight, easy and flexible without any code changes to your AI applications towards cloud native.
Tiejun Chen was Sr. technical leader. He ever worked at several tech companies such as VMware, Intel, Wind River Systems and so on, involved in - cloud native, edge computing, ML/AI, WebAssembly, etc. He ever made many presentations at AI.Dev NA 2023, kubecon China 2021 & 2024, Kube... Read More →
Kubernetes Isekai (異世界) is an open-source RPG designed for hands-on Kubernetes learning through gamification. Ideal for junior to Higher Diploma students at Hong Kong Institute of Information Technology (HKIIT), it transforms Kubernetes education into an engaging adventure.
Role-Playing Adventure: Students interact with NPCs who assign Kubernetes tasks. Task-Based Learning: Tasks involve setting up and managing Kubernetes clusters. Free Access: Uses AWS Academy Learner Lab with Minikube or Kubernetes. Scalable Grading: AWS SAM application tests Kubernetes setups within AWS Lambda. Progress Tracking: Students track progress and earn rewards. This game offers practical Kubernetes experience in a fun, cost-effective way. GenAI Chat: Integrates Generative AI to make NPC interactions more dynamic and fun, enhancing the overall learning experience. Demo https://www.youtube.com/watch?v=dIwNWwz681k
Senior Lecturer, Hong Kong Institute of Information Technology
Cyrus Wong is an accomplished senior lecturer who oversees the Higher Diploma program in Cloud and Data Centre Administration at the Hong Kong Institute of Information Technology (HKIIT) in Hong Kong. He is a passionate advocate for the adoption of cloud technology across various... Read More →
Cloud major student, Hong Kong Institute of Information Technology at IVE(Lee Wai Lee)
I am pursuing a Higher Diploma in Cloud and Data Centre Administration at the Hong Kong Institute of Information Technology at IVE (Lee Wai Lee) and am currently interning at Cathay Pacific Airways. This project teaches Kubernetes concepts and commands in a gamified way. By turning... Read More →
Agentic AI is revolutionizing how we create intelligent agents that can interact with the real world. However, building and deploying these systems often involves significant complexity and time investment. This demo-driven session introduces a cloud-native scaffolding approach, leveraging software templates to streamline and simplify the development of agentic AI projects. This results in a more efficient and developer-friendly experience. Through live demonstrations, attendees will see firsthand how this innovative scaffolding framework accelerates the development lifecycle of agentic AI applications. It provides automated code generation and pre-configured infrastructure. Seamless integration with popular AI libraries reduces overhead and complexity. By the end of the session, participants will have a clear understanding of how to adopt cloud-native scaffolding to revolutionize their development process and gain practical skills to drive innovation in their projects.
Daniel Oh is a Java Champion and Senior Principal Developer Advocate at Red Hat to evangelize developers for building cloud-native apps and serverless ob Kubernetes ecosystems. He's also contributing to various cloud open-source projects and ecosystems as a CNCF ambassador for accelerating... Read More →
Search, advertising, and recommendation services are among the primary business types within Xiaohongshu. Due to the strong dependency of these services on index table, each instance replica needs to maintain its own independent state. As a result, such services are deployed using the stateful workload. With the rapid growth of Xiaohongshu's business scale, the size limit of a single Kubernetes cluster has made it impossible to further scale stateful services. To address daily traffic and business growth, the only solution was to migrate workloads to idle clusters. However, this migration approach has caused significant inconvenience and risks for developer. To tackle this challenge, Xiaohongshu leveraged Karmada to implement the federation of stateful services. By designing scheduling and deployment capabilities for stateful services on federated clusters, This approach has seamlessly resolved the scaling limitations caused by single-cluster capacity constraints for stateful services.
Sunweixiang has previously worked in the Alibaba Cloud container team as software engineer and is a contributor to the OpenKruise community's main, Karmada, and other communities. He is deeply involved in container application orchestration, multi-cluster.
Song Yang is a Cloud Native Development Engineer at Xiaohongshu, currently working on multi-cluster and Kubernetes scheduler. He is a maintainer of the CNCF incubating project KubeVela.
Disaggregating the prefill and decoding phases in LLM inference has garnered significant attention in the industry because it can enhance performance. Several solutions have been developed, including Mooncake, TetriInfer, Splitwise, DistServe, and RTP-LLM. However, deploying a disaggregation LLM inference at scale on Kubernetes, while evaluating its performance and cost benefits presents numerous challenges. In this talk, we will introduce a solution that uses a LeaderWorkerSet as the workload, an Ingress Controller and a node discovery service. It can deploy disaggregated PD on Kubernetes, supporting multiple LLM inference engines like Mooncake and RTP-LLM with zero intrusion. Furthermore, we will discuss improving load balancing using Envoy and ORCA, based on KVCache and metrics, and recommending optimal ratios for the PD phases. Finally, we will cover essential features for production deployment such as high availability, elastic scaling, canary releases, and observability.
Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →
Jing Gu is a senior engineer at Alibaba Cloud. She works on Alibaba Cloud Container Service for Kubernetes , focusing on serving large language models (LLMs) within Kubernetes and optimizing LLM inference processes.
Kata Confidential Containers (CoCo) is a technology that provides hardware-based isolation for containerized workloads. It’s built on top of the Kata Containers project, which uses lightweight VMs to provide container isolation. It has the ability to disable file system sharing between host nodes and pods, which helps to reduce attack surfaces. However, such protection ability limits usage of Persistent Volumes. During this session, we will provide an introduction to Kata Confidential Containers and discuss the typical volume mount workflow of CSI drivers. We will cover the challenges that arise when supporting Kata CoCo in CSI drivers. We will explore the solutions we have developed to overcome these challenges and support Kata CoCo in our open source Azure File CSI driver. By the end of this session, you will have a comprehensive understanding of Kata confidential containers and be able to use them with persistent volumes including all the necessary details.
Andy Zhang is the storage lead in Azure Kubernetes Service team at Microsoft, maintainer of multiple Kubernetes projects, including Windows csi-proxy project, Azure CSI drivers, SMB, NFS, iSCSI CSI drivers, etc. Andy focuses on improving the experience of using storage in Kuberne... Read More →
The rise of WebAssembly (WASM) has sparked comparisons with Docker which often leads to questions and confusion: Are WASM and Docker competing technologies?
In this talk, we will see how this is far from the truth. On one side, Docker revolutionised how we bundle and deploy applications, offering unparalleled portability and simplifying workflows across environments. On the other hand, WASM brings speed, security, and efficiency, enabling the execution of code written in languages like C, C++, and Rust almost at native speed, performance, and rapid startup time even in the browser.
We will explore how these two technologies bring the best of both worlds and help developers achieve portability, efficiency, security, and flexibility. We will also look at how Docker is actively working to make WASM mainstream by allowing WASM container images to be hosted on DockerHub and run WASM containers alongside traditional Linux and Windows containers.
Pradumna is a Developer Advocate, Docker Captain, and a DevOps and Go Developer. He is passionate about Open Source and has mentored hundreds of people to break into the ecosystem. He also creates content on X (formerly Twitter) and LinkedIn, educating others about Open Source and... Read More →
What does the future of AI look like when we push the boundaries of cloud-based models and take it to the edge? In this talk, we’ll explore how Wasm and edge computing power AI deployment by providing developers with a fast, lightweight, and secure framework for running machine learning models across devices.
We’ll focus on how Wasm enables AI models to run efficiently on edge devices like NVIDIA GPUs, Mac, etc, driving LLM agents that require low latency and high throughput. This session will demonstrate the scalability of Wasm when integrated into distributed systems for AI processing, showing how the combination of edge computing and Wasm allows for faster, responsive AI applications that don’t rely on centralized cloud resources.
We’ll showcase real life use cases such as AI streamers commenting in real time, video translation agents deployment. Developers will walk away with an understanding of how to combine Wasm with edge infra to build and deploy AI apps that scale seamlessly
CNCF Ambassador, Founding member at WasmEdge, WasmEdge
Miley is a Dev Advocate who build & contribute to open source. She is the co-chair and keynote speaker for KubeCon+Open Source Summit and AI Dev China 2024. With 6 years of experience working on WasmEdge runtime in CNCF sandbox as the founding member, she talks at KubeCon, KCD Shenzhen... Read More →