KubeCon + CloudNativeCon China 2025: Full Schedule

10-11 June
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

09:00 HKT

Keynote: Introductory Remarks - Jim Zemlin, Executive Director, The Linux Foundation

Tuesday June 10, 2025 09:00 - 09:10 HKT

Level 16 | Grand Ballroom I

Speakers

Jim Zemlin

Executive Director, The Linux Foundation

Zemlin’s career spans three of the largest technology trends to rise over the last decade: mobile computing, cloud computing and open source software. Today, as executive director of The Linux Foundation, he uses this experience to accelerate the adoption of Linux and support the... Read More →

Tuesday June 10, 2025 09:00 - 09:10 HKT
Level 16 | Grand Ballroom I

Keynote Sessions

Content Experience Level Intermediate
Presentation Language English

09:12 HKT

Keynote: Community Opening Remarks - Chris Aniszczyk, CTO, Cloud Native Computing Foundation

Tuesday June 10, 2025 09:12 - 09:22 HKT

Level 16 | Grand Ballroom I

Speakers

Chris Aniszczyk

CTO, CNCF

Chris Aniszczyk is an open source executive and engineer with a passion for building a better world through open collaboration. He's currently a CTO at the Linux Foundation focused on developer relations and running the Open Container Initiative (OCI) / Cloud Native Computing Foundation... Read More →

Tuesday June 10, 2025 09:12 - 09:22 HKT
Level 16 | Grand Ballroom I

Keynote Sessions

Content Experience Level Intermediate
Presentation Language English

09:24 HKT

Keynote: Crossplane Is the Answer! but What Is the Question? - Amit Dsouza, Odyssey Cloud & Cortney Nickerson, Nirmata

Tuesday June 10, 2025 09:24 - 09:34 HKT

Level 16 | Grand Ballroom I

Why consider Crossplane when so many IaC tools exist—Terraform, Pulumi, CloudFormation, Config Connector, and KRO? What unique challenges does it solve, and is it always the right choice?
Join Cortney & Amit as they explore why Crossplane is gaining traction, not just as an IaC tool but as a Platform Engineering enabler. Learn how Crossplane extends the Kubernetes API to manage both infrastructure and applications declaratively, empowering platform teams.
Beyond provisioning, security and compliance are critical. Discover how the Crossplane + ArgoCD + Kyverno stack enables GitOps-driven automation, ensuring deployments align with organizational compliance and security policies.
Through real-world use cases, we’ll explore:
Where does Crossplane fit among IaC tools?
When is Crossplane NOT the right choice?
How can it enable scalable, self-service platforms?
How does it integrate with ArgoCD & Kyverno for GitOps and security?

Speakers

Amit DSouza

Co-founder, Odyssey Cloud

Amit Dsouza is an IT professional with over 13 years of experience in the industry. He is a co-founder of Odyssey Cloud, Australia. With experience in Fortune 500 companies & startups, he has worked in various locations including Australia, Singapore, & India. Amit specializes in... Read More →

Cortney Nickerson

Head of Community, Nirmata

Cortney is Head of Community at Nirmata. As a CNCF and Civo Ambassador, co-organizer of CNCF Bilbao Community, and speaker and organizing member of various KCD events, she is a recognized voice in the cloud native space. Initially, a non-techie, she turned techie as employee 7 at... Read More →

Tuesday June 10, 2025 09:24 - 09:34 HKT
Level 16 | Grand Ballroom I

Keynote Sessions, Platform Engineering

Content Experience Level Intermediate
Presentation Language English

09:36 HKT

Sponsored Keynote: Towards Clouds of AI Clusters - Bill Ren, Huawei Chief Open Source Liaison Officer, Board member of CNCF

Tuesday June 10, 2025 09:36 - 09:41 HKT

Level 16 | Grand Ballroom I

AI is quickly becoming the most important workload in our clouds. However, AI is not like other cloud native workloads. Whereas before, clouds could manage elastic resources that easily and cheaply scaled out, AI workloads do not readily support this. AI hardware infrastructure is moving towards large clusters of processors, is not readily scaled out, is not readily available on-demand, and is much more expensive. This requires significant changes to how we build and
manage our clouds, from the operating system up to our cloud native infrastructure. This talk will highlight how this evolution towards clouds of AI clusters is happening through projects such as Linux, Volcano, and Karmada.

Speakers

Bill Ren

Chief Open Source Liaison Officer，Board member of CNCF, Huawei

Bill Ren holds an EMBA and Master Degree from Peking University, and a CS Bachelor Degree from Shanghai Jiaotong University. Since Joining Huawei in 2000, Bill served as an Intelligent Network Research and Development Engineer, Product Manager and Architect of India Branch, General... Read More →

Tuesday June 10, 2025 09:36 - 09:41 HKT
Level 16 | Grand Ballroom I

Keynote Sessions

Content Experience Level Intermediate
Presentation Language English

09:55 HKT

Keynote: Scaling Model Training with Volcano: iFlytek’s Kubernetes Breakthrough - Dong Jiang, Platform Architect, iFlytek & Xuzheng Chang, Software Engineer, Huawei Cloud

Tuesday June 10, 2025 09:55 - 10:00 HKT

Level 16 | Grand Ballroom I

Training massive AI models at scale is tough—but doing it efficiently in Kubernetes is even tougher. In this keynote, we’ll share how iFlytek tackled key challenges in large-scale model training, including low GPU utilization, fragile workflows, and resource contention across teams. By leveraging Volcano, they boosted GPU usage by over 40%, and cut failure recovery time by 70%. This talk offers a quick but powerful look at how intelligent scheduling and orchestration can unlock performance, reliability, and fairness in multi-tenant AI platforms.

Speakers

Xuzheng Chang

Software Engineer, Huawei Cloud

Xuzheng Chang is a maintainer of the Volcano community, with in-depth research and practical experience in the fields of batch computing and cloud-native AI scheduling. Xuzheng has spearheaded several significant features within the Volcano community. Actively contributing to open-source... Read More →

Dong Jiang

Platform Architect, iFlytek

Tuesday June 10, 2025 09:55 - 10:00 HKT
Level 16 | Grand Ballroom I

Keynote Sessions

Content Experience Level Intermediate
Presentation Language English

10:13 HKT

Keynote: Closing Remarks

Tuesday June 10, 2025 10:13 - 10:15 HKT

Level 16 | Grand Ballroom I

Tuesday June 10, 2025 10:13 - 10:15 HKT
Level 16 | Grand Ballroom I

Keynote Sessions, Platform Engineering

Content Experience Level Intermediate
Presentation Language English

11:45 HKT

Defining a Specification for AI/ML Artifacts - Fog Dong, BentoML; Gorkem Ercan, Jozu; Peng Tao & Chlins Zhang, Ant Group; Xudong Wang, Paypal

Tuesday June 10, 2025 11:45 - 12:15 HKT

Level 19 | Crystal Court I

AI has become a prominent figure in the cloud native ecosystem and there continues to be massive adoption in this emerging field. As frameworks and approaches are introduced, a pattern has emerged which threatens the ability to manage at scale: each implementation introduces their own format, runtime, and different ways of working, fragmenting the ecosystem. On other hand, open standards are the backbone of cohesive and scalable ecosystems.

This panel discussion seeks to explore the importance of defining standards within the CNCF ecosystem, particularly focusing on AI/ML artifacts. Beyond the advantages of the standard in facilitating integration with existing cloud native tools, this conversation will delve into how the standards can serve as a foundation for innovation. Join us to understand how standardization with innovative approaches can advance the cloud native AI landscape.

Speakers

Chlins Zhang

Software Engineer, Ant Group

Chenyu Zhang is a software engineer at Ant Group, currently mainly responsible for the development and maintenance of project harbor, and also has some experience in devops and cloud native related technology stacks.

Xudong Wang

PayPal

Peng Tao

Staff Engineer, Ant Group

Kata Containers architecture committee member, Nydus maintainer, and Linux kernel developer.

Fog Dong

Senior Software Engineer, BentoML

董天欣目前在 BentoML担任资深工程师，同时，她也是 KubeVela 的核心维护者以及 CNCF 大使。她致力于开源社区的建设，并不遗余力地为推动开源项目的发展而努力，尤其是在云原生 DevOps 领域。目前，她在 BentoML... Read More →

Gorkem Ercan

CTO, Jozu

Gorkem Ercan is a co-founder and CTO of Jozu. Gorkem has experience working and leading teams with various technologies ranging from building IDEs, to building mobile phones, and CI/CD systems. He is an avid contributor and supporter of open source and previously served at the Eclipse... Read More →

Tuesday June 10, 2025 11:45 - 12:15 HKT
Level 19 | Crystal Court I

AI + ML

Content Experience Level Intermediate
Presentation Language English

13:45 HKT

Antipatterns in Observability: Lessons Learned and How OpenTelemetry Solves Them - Steve Flanders, Splunk

Tuesday June 10, 2025 13:45 - 14:15 HKT

Level 16 | Grand Ballroom I

Observability is essential, but common antipatterns like over-collecting data, siloed tools, and poorly instrumented code can derail your efforts. This session uncovers the most frequent observability pitfalls and shows how OpenTelemetry addresses these challenges with its standardized approach. From eliminating vendor lock-in to streamlining telemetry pipelines, you’ll gain insights into building a more effective and sustainable observability strategy. Real-world examples will highlight how teams have successfully overcome these antipatterns, empowering you to avoid costly mistakes and maximize OpenTelemetry’s potential.

Speakers

Steve Flanders

Senior Director of Engineering, Splunk

Steve Flanders is a Senior Director of Engineering at Splunk responsible for the Observability Platform team, which includes contributions to the OpenTelemetry project. Previously, he was the Head of Product and Experience at Omnition, which Splunk acquired. Prior to Omnition, he... Read More →

Tuesday June 10, 2025 13:45 - 14:15 HKT
Level 16 | Grand Ballroom I

Observability

Content Experience Level Intermediate
Presentation Language English

14:30 HKT

More Than Model Sharding: LWS & Distributed Inference - Peter Pan & Nicole Li, DaoCloud

Tuesday June 10, 2025 14:30 - 15:00 HKT

Level 19 | Crystal Court I

Large LLM like Llama3.1-405B or Deepseek-V3 (671B), require distributed inference across multiple-nodes like vLLM + Ray backend.
However, it's more than just model-slicing with tensor-parallelism, Native K8S treats those workloads across nodes irrelevantly , so challenges come:
- standalone statefulSets without coordination
- demand of Gang-scheduling
- uncontrolled startup order among master & workers, causing boot lag
- HPA as a whole instead of for each sts, to scale together for both Ray head/worker.
- stable index and rank
- topology aware grouping
- failure recovery for vllm/pytorch(not smart enough), to avoid one pod/GPU failure disrupting overall inference

----
So LWS - LeaderWorkerSet (github.com/kubernetes-sigs/lws) , is designed to address them:
- to optimize resource coordination with leader-worker set
- improve performance thru co-location
- integrate scaling with HPA for whole lws together
- all-or-nothing restart policy to fault tolerance as a group.

Speakers

Nicole Li

Cloud Native Developer, DaoCloud

Cloud Native Developer, Service Mesh & Istio Contributor, AI Newbie

Peter Pan

R&D Engineering VP, Daocloud

- DaoCloud Software Engineering VP- Regular KubeCon "Program Committee" : 2023 EU, 2024 HK, 2024 India, 2025 EU- Regular KubeCon Speaker: 2023 SH, 2024 EU, 2024 HK- Maintainer of below CNCF projects : cloudtty, kubean, hwameistor- CNCF WG-AI (AI Working-Group) Member + CNAI white-paper... Read More →

Tuesday June 10, 2025 14:30 - 15:00 HKT
Level 19 | Crystal Court I

AI + ML

Content Experience Level Intermediate
Presentation Language Chinese

15:44 HKT

⚡ Lightning Talk: AI-Powered Kubernetes Diagnostics With K8sGPT - Kay Yan, DaoCloud

Tuesday June 10, 2025 15:44 - 15:49 HKT

Level 16 | Grand Ballroom I

In this Lightning Talk, we’ll dive into K8sGPT, a CNCF sandbox project that uses AI to enhance Kubernetes management. K8sGPT leverages LLMs to diagnose cluster issues, offering root cause analysis and solutions in simple terms. It encodes SRE expertise into analyzers, extracting key insights and enriching them with AI-powered explanations.
Key highlights:
- Core Features: Learn to use the CLI and K8sGPT Operator for cluster error analysis and contextualized insights.
- AI Integration & Security: Explore integration with AI models like OpenAI, Azure, and Ollama, with data anonymization for security.
- Real-world Demos: See how K8sGPT simplifies Kubernetes troubleshooting.
- Enterprise Strategies: Discover techniques like LoRA and RAG to tailor K8sGPT for specific environments.
Whether you're new to Kubernetes or an expert, K8sGPT can streamline cluster management, reduce troubleshooting time, and boost efficiency.

Speakers

Kay Yan

Principal Software Engineer, DaoCloud

Kay Yan is kubespray maintainer, containerd/nerdctl maintainer. He is the Principal Software Engineer in DaoCloud, and develop the DaoCloud Enterprise Kubernetes Platform since 2016.

Tuesday June 10, 2025 15:44 - 15:49 HKT
Level 16 | Grand Ballroom I

⚡ Lightning Talks, Operations + Performance

Content Experience Level Intermediate
Presentation Language English

15:51 HKT

⚡ Lightning Talk: Best Practices for Upgrading Service Mesh Seamlessly - Hang Yin, Alibaba Cloud & Zhencheng Lee, Huawei Technologies

Tuesday June 10, 2025 15:51 - 15:56 HKT

Level 16 | Grand Ballroom I

Service Mesh is thriving, with new versions always incorporating exciting features and significant CVE fixes that bring considerable benefits to users. However, the disruption of service traffic caused by Service Mesh upgrades or restarts, leading to system instability, remains a major obstacle to the usage of Service Mesh in production. In the most mature sidecar model, upgrading the data plane of the service mesh results in the redeployment of services; in some cases, this is nearly unacceptable, as certain business applications may face substantial cold start costs . Even for the rising sidecarless mode, it is still necessary to address the issue of interrupting existing user connections, which requires difficult choices. This topic will begin with real-world case studies, where technical experts from Huawei Cloud and Alibaba Cloud will share practical experiences on seamless service mesh upgrades in real production scenarios with the users.

Speakers

Hang Yin

Senior R&D Engineer, Alibaba Cloud

Hang Yin, senior engineer of Alibaba Cloud, focusing on Kubernetes, service mesh and other cloud native fields. Currently served in the Alibaba Cloud Service Mesh (ASM) team, responsible for core abilities of ASM such as performance improvement, ecosystem and Mesh Topology.

Zhencheng Lee

Huawei Cloud Senior R&D engineers, Huawei Technologies Co., Ltd.

Senior Engineer at Huawei Cloud, specializes in Kubernetes, service mesh, and other cloud-native technologies. I am the primary developer and maintainer of the CNCF project Kmesh and actively contribute to several other CNCF projects, with a particular emphasis on service mesh and... Read More →

Tuesday June 10, 2025 15:51 - 15:56 HKT
Level 16 | Grand Ballroom I

⚡ Lightning Talks, Connectivity

Content Experience Level Intermediate
Presentation Language Chinese

16:05 HKT

⚡ Lightning Talk: Disaster Recovery - How IaCaC and Kubernetes Enables Cost Efficiency and Fast Recovery - Sandy Wang, KPMG Australia

Tuesday June 10, 2025 16:05 - 16:10 HKT

Level 16 | Grand Ballroom I

Tech startup in early stage normally aim low running cost on infrastructure spend but fast development and delivery. When there are a first few clients onboard, disaster recovery plan is a must have. When DR is required and an agreed RTO is 6 hours for example, how to not only remain low running cost but also to meet agreed RTO and SLA, our DR plan and implementation is a success to share with the audience. We onboarded container orchestration platform Kubernetes, DevOps best practices, for example Infrastructure-and-Configuration-as-Code and Pipeline-as-Code. Our DR implementation only spends a minimum cost on always-on resources. When a DR incident happens, automated pipelines will bring up on-demand resources that include a Kubernetes cluster, and geo-recover database and storage, then deploy the latest applications into kubernetes cluster, production DR can be live within 2 hours.

Speakers

Pei (Sandy) Wang

Senior DevSecOps Engineer, KPMG Australia

As a Senior DevSecOps Engineer at KPMG Australia, I have been leading the cloud operations and security for Origins, a blockchain-based SaaS solution for supply chain traceability, since May 2022. I have brought the best practices of DevSecOps into day-to-day development and delivery... Read More →

KubeCon 2025 Lightning talk by Sandy Wang pdf

Tuesday June 10, 2025 16:05 - 16:10 HKT
Level 16 | Grand Ballroom I

⚡ Lightning Talks, Operations + Performance

Content Experience Level Intermediate
Presentation Language English

16:19 HKT

⚡ Lightning Talk: Dynamic GPU Fraction and Sharing With Cloud Native Principle - Tiejun Chen, Individual Contributor

Tuesday June 10, 2025 16:19 - 16:24 HKT

Level 16 | Grand Ballroom I

As we see, organizations are investing heavily in bringing AI accelerators into their data centers or using them on the public cloud but continue to struggle with the cost-effective and efficient management of these critical resources. There are some existing approaches to address them but heavy and inflexible. Here, we'd like to take this chance to review if-how we can address the challenges of expensive and limited machine learning compute resources like GPU and identifies solutions for GPU fractional optimization with our technical PoC - GPU.x by transparent backend Python hooker within ML upstream frameworks running Kubernetes. It's lightweight, easy and flexible without any code changes to your AI applications towards cloud native.

Speakers

Tiejun Chen

Sr. Technical Lead, Individual Contributor

Tiejun Chen was Sr. technical leader. He ever worked at several tech companies such as VMware, Intel, Wind River Systems and so on, involved in - cloud native, edge computing, ML/AI, WebAssembly, etc. He ever made many presentations at AI.Dev NA 2023, kubecon China 2021 & 2024, Kube... Read More →

Tuesday June 10, 2025 16:19 - 16:24 HKT
Level 16 | Grand Ballroom I

⚡ Lightning Talks, Emerging + Advanced

Content Experience Level Intermediate
Presentation Language English

16:47 HKT

⚡ Lightning Talk: Mastering Prefill-Decode-Disaggregated Architecture: Solutions and Best Practices in Alibaba Cloud - Jing Gu & Yang Che, Alibaba Cloud

Tuesday June 10, 2025 16:47 - 17:52 HKT

Level 16 | Grand Ballroom I

Disaggregating the prefill and decoding phases in LLM inference has garnered significant attention in the industry because it can enhance performance. Several solutions have been developed, including Mooncake, TetriInfer, Splitwise, DistServe, and RTP-LLM. However, deploying a disaggregation LLM inference at scale on Kubernetes, while evaluating its performance and cost benefits presents numerous challenges.
In this talk, we will introduce a solution that uses a LeaderWorkerSet as the workload, an Ingress Controller and a node discovery service. It can deploy disaggregated PD on Kubernetes, supporting multiple LLM inference engines like Mooncake and RTP-LLM with zero intrusion. Furthermore, we will discuss improving load balancing using Envoy and ORCA, based on KVCache and metrics, and recommending optimal ratios for the PD phases. Finally, we will cover essential features for production deployment such as high availability, elastic scaling, canary releases, and observability.

Speakers

Yang Che

senior software engineer, Alibaba Cloud

Yang Che, is a senior engineer of Alibaba Cloud. He works in Alibaba cloud container service team, and focuses on Kubernetes and container related product development. Yang also works on building elastic machine learning platform on those technologies. He is an active contributor... Read More →

Jing Gu

Software Engineer, Alibaba Cloud

Jing Gu is a senior engineer at Alibaba Cloud. She works on Alibaba Cloud Container Service for Kubernetes , focusing on serving large language models (LLMs) within Kubernetes and optimizing LLM inference processes.

Tuesday June 10, 2025 16:47 - 17:52 HKT
Level 16 | Grand Ballroom I

⚡ Lightning Talks, AI + ML

Content Experience Level Intermediate
Presentation Language Chinese

16:54 HKT

⚡ Lightning Talk: Kata Confidential Containers Meet Persistent Storage: Overcoming CSI Driver Challenges - Andy Zhang & Archana Choudhary, Microsoft

Tuesday June 10, 2025 16:54 - 16:59 HKT

Level 16 | Grand Ballroom I

Kata Confidential Containers (CoCo) is a technology that provides hardware-based isolation for containerized workloads. It’s built on top of the Kata Containers project, which uses lightweight VMs to provide container isolation. It has the ability to disable file system sharing between host nodes and pods, which helps to reduce attack surfaces. However, such protection ability limits usage of Persistent Volumes. During this session, we will provide an introduction to Kata Confidential Containers and discuss the typical volume mount workflow of CSI drivers. We will cover the challenges that arise when supporting Kata CoCo in CSI drivers. We will explore the solutions we have developed to overcome these challenges and support Kata CoCo in our open source Azure File CSI driver. By the end of this session, you will have a comprehensive understanding of Kata confidential containers and be able to use them with persistent volumes including all the necessary details.

Speakers

Archana Choudhary

Ms, Microsoft

A software engineer who has been exploring cloud-native technologies, particularly focusing on confidential containers over the past several months.

Andy Zhang (OSTC)

Principal Software Engineer, Microsoft

Andy Zhang is the storage lead in Azure Kubernetes Service team at Microsoft, maintainer of multiple Kubernetes projects, including Windows csi-proxy project, Azure CSI drivers, SMB, NFS, iSCSI CSI drivers, etc. Andy focuses on improving the experience of using storage in Kuberne... Read More →

Tuesday June 10, 2025 16:54 - 16:59 HKT
Level 16 | Grand Ballroom I

⚡ Lightning Talks, Data Processing + Storage

Content Experience Level Intermediate
Presentation Language English

09:00 HKT

Keynote: Welcome Back + Opening Remarks - Keith Chan, Director of Strategic Planning, The Linux Foundation APAC

Wednesday June 11, 2025 09:00 - 09:10 HKT

Level 16 | Grand Ballroom I

Speakers

Keith Chan

Director of Strategic Planning, The Linux Foundation APAC

Wednesday June 11, 2025 09:00 - 09:10 HKT
Level 16 | Grand Ballroom I

Keynote Sessions

Content Experience Level Intermediate
Presentation Language English

09:36 HKT

Keynote: Who Owns Your Pod? Observing and Blocking Unwanted Behavior at eBay With eBPF - Jianlin Lv, eBay & Liyi Huang, Isovalent at Cisco

Wednesday June 11, 2025 09:36 - 09:46 HKT

Level 16 | Grand Ballroom I

Kubernetes admins often struggle to understand pod activities, both for regular pods and those with various privileges. This session explores two use cases that highlight why Tetragon, an eBPF-based observability and enforcement tool, for pod security:
1.Replacing Auditbeat with Tetragon: Learn how Auditbeat rules mapped to Tetragon tracing policies, identifying functionality gaps, and how eBay contributed back to the community
2.Auditing Container Process Permissions: See how Tetragon helped analyze pod behavior and determine if applications could migrate to more restrictive pod security policies, ensuring adherence to the principle of least privilege
We also cover deployment challenges, such as integrating with SIEM platforms, resource utilization, and implementing runtime enforcement for unwanted pod behavior. This talk provides practical insights into using Tetragon for observability, policy refinement, and improving overall pod security posture in Kubernetes environments.

Speakers

Jianlin Lv

Senior Linux Kernel Development Engineer, eBay

https://www.linkedin.com/in/jianlin-lv-25650141/

Liyi Huang

customer success architect, Isovalent at Cisco

senior solution architect @isovalent.com

Wednesday June 11, 2025 09:36 - 09:46 HKT
Level 16 | Grand Ballroom I

Keynote Sessions, Observability

Content Experience Level Intermediate
Presentation Language Chinese

10:10 HKT

Keynote: Closing Remarks

Wednesday June 11, 2025 10:10 - 10:15 HKT

Level 16 | Grand Ballroom I

Wednesday June 11, 2025 10:10 - 10:15 HKT
Level 16 | Grand Ballroom I

Keynote Sessions, Platform Engineering

Content Experience Level Intermediate
Presentation Language English

11:00 HKT

Unified Observability in GRPC: Metrics and Tracing Using OpenTelemetry Plugin - Purnesh Dixit, Google

Wednesday June 11, 2025 11:00 - 11:30 HKT

Level 16 | Grand Ballroom I

gRPC’s performance advantages hinge on minimizing latency, but its binary protocol and streaming capabilities make debugging and monitoring inherently opaque. While distributed tracing identifies bottlenecks, metrics like error rates and throughput are critical for holistic insights. Yet, manual instrumentation for these signals in gRPC is complex, error-prone, and lacks standardization.

In this talk, Purnesh Dixit from the gRPC team unveils the new OpenTelemetry plugin for gRPC, developed by the gRPC team at Google, which provides unified metrics and tracing out-of-the-box to monitor retries, diagnose streaming bottlenecks, and optimize performance without invasive code changes.
1) Client-per-call: Track overall RPC lifecycle (e.g., grpc.client.call.duration).

2) Client-per-call-attempt: Analyze individual retries/hedges (e.g., grpc.client.attempt.duration).

3) Server-instruments: Measure concurrency, request queuing, and stream lifetimes (e.g., grpc.server.call.started).

Speakers

Purnesh Dixit

Purnesh Dixit (gRPC Team, Google), Google

Purnesh is a software engineer on the gRPC team at Google. He is a contributor to the OpenTelemetry support in gRPC-go.

Wednesday June 11, 2025 11:00 - 11:30 HKT
Level 16 | Grand Ballroom I

Observability

Content Experience Level Intermediate
Presentation Language English

11:00 HKT

Resilient Multiregion Global Control Planes With Crossplane and K8gb - Yury Tsarev & Steven Borrelli, Upbound

Wednesday June 11, 2025 11:00 - 11:30 HKT

Level 19 | Crystal Court I

Ensuring resilience in control planes is critical for organizations managing infrastructure and applications across multiple regions with Kubernetes. This talk presents a reference architecture for creating a Crossplane-based Global Control Plane, enhanced with k8gb for DNS-based failover and leveraging an Active/Passive setup.
We’ll explore how Crossplane’s declarative infrastructure provisioning integrates with k8gb to build robust, scalable, and resilient multicluster environments. Key takeaways include:

- Architecting resilient multiregion control planes with Active/Passive roles
- Demonstrating failover mechanisms where the Passive control plane transitions to Active during failures
- Strategies for optimizing failover times while maintaining availability

This session will guide attendees through proven methods and real-world challenges of building resilient Global Control Planes, empowering them to manage critical workloads across geographically distributed regions confidently.

Speakers

Steven Borrelli

Principal Soutions Architect, Upbound

Steven is a Principal Solutions Architect for Upbound, where he helps customers adopt Crossplane.

Yury Tsarev

Principal Solutions Architect, Upbound

Yury is an experienced software engineer who strongly focuses on open-source, software quality and distributed systems. As the creator of k8gb (https://www.k8gb.io) and active contributor to the Crossplane ecosystem, he frequently speaks at conferences covering topics such as Control... Read More →

Wednesday June 11, 2025 11:00 - 11:30 HKT
Level 19 | Crystal Court I

Operations + Performance

Content Experience Level Intermediate
Presentation Language English

11:45 HKT

How Bloomberg Creates a Resilient Data Analytics Platform Using Karmada - Michas Szacillo & Ilan Filonenko, Bloomberg

Wednesday June 11, 2025 11:45 - 12:15 HKT

Level 19 | Crystal Court II

Bloomberg’s Data Analytics Platform Engineering team supports a wide-range of real-time streaming, large batch ETL, and data exploration use-cases by using Apache Flink, Apache Spark, and Trino across multi-cluster Kubernetes. However, deploying and managing these workflows at scale efficiently can be challenging due to varying resource requirements and uptime needs. For stateful applications like Apache Flink, ensuring recovery and state conservation after downtime is especially important.

This session will discuss how Bloomberg uses Karmada, a multi-cluster management system, to deploy and manage Apache Flink. We’ll also explore how Karmada’s capabilities can be expanded to handle additional data analytics workloads, including Apache Spark and Trino. The session will cover the unique requirements and real-life use-cases for each, including:

- Resource-aware workload scheduling
- Custom resource requirements and health interpretation
- State conservation during application failover

Speakers

Ilan Filonenko

Engineering Group Lead, Bloomberg

Ilan Filonenko is an Engineering Group Lead focusing on Cloud Native Data Analytics Infrastructure at Bloomberg - where he has designed and implemented distributed systems at both the application and infrastructure level. Previously, Ilan was an engineering consultant and technical... Read More →

Michas Szacillo

Tech Lead, Bloomberg L.P.

Michas is a senior software engineer and tech lead on Bloomberg’s Streaming Analytics engineering team. The platform, which is running on Kubernetes, serves as the foundation for many of Bloomberg's data streaming use cases. Michas is also a frequent collaborator to the CNCF community... Read More →

Wednesday June 11, 2025 11:45 - 12:15 HKT
Level 19 | Crystal Court II

Data Processing + Storage

Content Experience Level Intermediate
Presentation Language English

13:45 HKT

Solidigm CSAL Solution Brings Advanced IO Shaping, Caching and Data Placement Into NVIDIA DPU DOCA S - Wayne Gao, Solidigm & Long Chen, NVIDIA

Wednesday June 11, 2025 13:45 - 14:15 HKT

Level 19 | Crystal Court II

CSAL is Cloud Storage Acceleration Layer for BigData and AI. it is open-source user mode FTL, cache and io trace component inside SPDK(upstreamed). It commercially helps Alibaba cloud storage system.
refer https://www.solidigm.com/products/technology/cloud-storage-acceleration-layer-write-shaping-csal.html. Alibaba and Solidigm joint top computer conference paper Eurosys2024 https://dl.acm.org/doi/pdf/10.1145/3627703.3629566
Session Topics:
This session is joint development with NVIDIA DPU team and BeeGFS
1. CSAL leverage DPU DRAM as CSAL write buffer who achieve best storage latency ever also promise the data consistency.
2. QLC high density storage is favorable by AI industry since it save power and space for AI Data Center. DPU storage solution can achieve same thing, it is great combine two things together.
3. CSAL bring advanced storage IO shaping, caching and data placement SW into NVIDIA DPU DOCA storage SW service,
4. DPU and CSAL and BeeGFS experiment data sharing and report

Speakers

Long Chen

Director, NVIDIA

Take charge of promoting NVIDIA networking for high speed storage and new application market in China

Wayne Gao

Princinple storage solution architect, Solidigm

Wayne Gao is a Principal Engineer as Storage solution architect and worked on CSAL from PF to Alibaba commercial release. Wayne also takes main developer effort to finish CSAL pmem/DSA and cxl.mem PF from intel to Solidigm. Before joining Intel, Wayne has over 20 years of storage... Read More →

Wednesday June 11, 2025 13:45 - 14:15 HKT
Level 19 | Crystal Court II

Data Processing + Storage

Content Experience Level Intermediate
Presentation Language Chinese

13:45 HKT

Progressive Delivery Made Easy With Argo Rollouts - Kevin Dubois, Red Hat

Wednesday June 11, 2025 13:45 - 14:15 HKT

Level 19 | Crystal Court I

ou might already be using a CI/CD solution, but are you 100% sure things will roll out without a glitch once you go to production? Unfortunately differences between testing/staging and production environments are virtually unavoidable. There’s always a risk for unforeseen issues related to your production environment and/or actual load which can lead to potential disruptions to your users.

Progressive delivery is the next step after Continuous Delivery to roll out your application in a controlled and automated way so you can verify and test your application *in production* before it becomes fully available to all your user bases.

Embrace GitOps and Progressive Delivery with techniques like blue-green, canary release, shadowing traffic, dark launches and automatic metrics-based rollouts to validate the application in production using Kubernetes and tools like Istio, Prometheus, ArgoCD, and Argo Rollouts.

Come to this session to learn about Progressive Delivery in action using Kubernetes.

Speakers

Kevin Dubois

Senior Principal Developer Advocate, Red Hat

Kevin is a Java Champion, software engineer, author and international speaker with a passion for Open Source, Java, and Cloud Native Development & Deployment practices. He currently works as developer advocate at Red Hat where he gets to enjoy working with Open Source projects and... Read More →

Wednesday June 11, 2025 13:45 - 14:15 HKT
Level 19 | Crystal Court I

Platform Engineering

Content Experience Level Intermediate
Presentation Language English

13:45 HKT

Connecting Dots: Unified Hybrid Multi-Cluster Auth Experience With SPIFFE and Cluster Inventory API - Chen Yu, Microsoft & Jian Zhu, Red Hat

Wednesday June 11, 2025 13:45 - 14:15 HKT

Level 16 | Grand Ballroom I

As the multi-cluster pattern continues to evolve, managing K8s identities, credentials, and permissions for teams and multi-cluster apps, such as Argo and Kueue, has become a hassle, typically involving managing individual service accounts on each cluster and passing credentials around. Such setup is often scattered, repetitive, difficult to track/audit, and may impose security and ops complications. This is especially true with hybrid environments, where different solutions could be in play across platforms.

This demo presents a solution based on OpenID, SPIFFE/SPIRE, and Cluster Inventory API from the Multi-Cluster SIG that provides a unified, seamless, and secure auth experience. Facilitated by CNCF multi-cluster projects, OCM and KubeFleet, attendees could be inspired to leverage open source solutions to eliminate credential sprawl, reduce operational complexity, and enhance security in hybrid cloud environments, when setting up teams/applications to access a multi-cluster setup.

Speakers

Chen Yu

Senior Software Engineer, Microsoft

Chen Yu is a senior software engineer at Microsoft with a keen interest in cloud-native computing. He is currently working on Multi-Cluster Kubernetes and contributing to the Fleet project open-sourced by Azure Kubernetes Service.

Jian Zhu

Senior Software Engineer, RedHat

Zhu Jian is a senior software engineer at RedHat, a speaker at Kubecon China 2024, and a core contributor to the open cluster management project. Jian enjoys solving multi-cluster workload distribution problems and extending OCM with add-ons.

Wednesday June 11, 2025 13:45 - 14:15 HKT
Level 16 | Grand Ballroom I

Security

Content Experience Level Intermediate
Presentation Language Chinese

14:30 HKT

The Past, the Present, and the Future of Platform Engineering - Mauricio "Salaboy" Salatino, Diagrid & Viktor Farcic, Upbound

Wednesday June 11, 2025 14:30 - 15:00 HKT

Level 19 | Crystal Court I

Do you think platform engineering is too hard? Or is it just a buzzword? Is the CNCF landscape too tricky to visualize? If you’ve been in this industry long enough, you should know that platform engineering has been around for a long time.

Most of us have been trying to build developer platforms for decades, and most of us have failed at that. That begs the questions: “What is different now?” “Why will this time be different?” and “Do we have a chance to succeed?”

We’ll take a look at the past, the present, and the future of platform engineering. We’ll see what we were doing in the past, what we did wrong, and why we failed. Further on, we’ll see what we (the industry as a whole) are doing now and, more importantly, where we might go from here.

Get ready for the hard truths and challenges you will face when trying to build a platform based on Kubernetes. Join us for a pain-infused journey filled with challenges teams will face when building platforms to enable other teams.

Speakers

Viktor Farcic

Viktor Farcic, Upbound

Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.

Mauricio Salatino

Software Engineer, Diagrid

Mauricio works as an Open Source Software Engineer at @Diagrid, contributing to and driving initiatives for the Dapr OSS project. Mauricio also serves as a Steering Committee member for the Knative Project and Co-Leading the Knative Functions initiative. He published a book titled... Read More →

Wednesday June 11, 2025 14:30 - 15:00 HKT
Level 19 | Crystal Court I

Platform Engineering

Content Experience Level Intermediate
Presentation Language English

15:30 HKT

Stability in Large Model Training: Practices in Software and Hardware Fault Self-Healing - Yang Cao, Ant Group

Wednesday June 11, 2025 15:30 - 16:00 HKT

Level 19 | Crystal Court II

Training trillion-parameter AI models requires significant GPU resources, where any idle time leads to increased costs. Maintaining full-speed GPU utilization is crucial, yet hardware and software failures (such as firmware, kernel, or hardware issues) often disrupt large-scale training. For example, LLaMA3 experienced 419 interruptions over 54 days, with 78% due to hardware issues, underscoring the necessity for automated anomaly recovery.
At Ant Group, we will share:
GPU Monitoring: Comprehensive monitoring from hardware to applications to ensure optimal performance.
Self-Healing for Large GPU Clusters: Automated fault isolation, recovery from kernel panics, and node reprovisioning for clusters with 10,000+ GPUs.
Core Service Level Objectives (SLOs): Achieving over 98% GPU availability and more than 90% automatic fault isolation.
Predictive Maintenance: Using failure pattern analysis to reduce downtime and improve reliability.

Speakers

Yang Cao

senior engineer, Ant Group

Yang Cao Senior Engineer, Ant Group Yang Cao is a senior engineer at Ant Group, currently focusing on ensuring the stability of large-scale distributed training on Kubernetes.

Wednesday June 11, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court II

Cloud Native Experience

Content Experience Level Intermediate
Presentation Language Chinese

15:30 HKT

Composable Platforms: Modular Platform Engineering With Kratix and Backstage - Hossein Salahi, Liquid Reply

Wednesday June 11, 2025 15:30 - 16:00 HKT

Level 19 | Crystal Court I

Constructing and managing platforms for diverse teams and workloads presents a significant challenge in today's cloud-native environment. This session introduces the concept of composable platforms, using modular, reusable components as the foundation for platform engineering. This talk will demonstrate how using Kratix, a workload-centric framework, and Backstage an extensible developer portal enables the creation of self-service platforms that balance standardization with adaptability.

The session will detail platform design for scalability and governance, streamlining developer workflows through Backstage, and using Kratix Promises for varied workload requirements. Attendees will gain practical insights into building scalable and maintainable platforms through real-world examples, architectural patterns, and a live demonstration of a fully integrated Kratix-Backstage deployment.

Speakers

Hossein Salahi

Tech Lead, Liquid Reply

Hossein is an experienced cloud computing professional with nearly a decade of expertise in distributed systems and cloud technologies. He began as a student specializing in cloud automation and progressed to a full-time role focusing on on-premises cloud infrastructure and containers... Read More →

Wednesday June 11, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court I

Platform Engineering

Content Experience Level Intermediate
Presentation Language English