The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.
Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.
Sign up or log in to add sessions to your schedule and sync them to your phone or calendar.
Please note we are unable to store any items overnight and cameras, laptop equipment or any other electronic devices cannot be stored in the cloakroom at any time.
As China's leading video platform, Bilibili faces 4 key challenges in multi-cluster AI workloads management: 1. Workload Diversity: Training/inference/video processing workloads have different scheduling requirements. 2. Cross-Cluster Complexity: Managing workloads across multiple Kubernetes clusters in expanding IDCs with SLAs. 3. Performance Demands: Minimal startup latency and best scheduling efficiency for short-running tasks e.g. video processing. 4. Efficiency-QoS Balance: maximizing resource utilization while ensuring priority workload stability.
This talk will share experiences and delve specific optimization techniques: 1. Leveraging and optimizing CNCF projects such as Karmada and Volcano to build a unified, high-performance AI workload scheduling platform. 2. Integrating technologies such as KubeRay to schedule various AI online and offline workloads. 3. Maximizing resource efficiency through online and offline hybrid scheduling, tidal scheduling and other technologies.
Technical Expert, Lead of Cloud Native Open Source, Huawei
Kevin Wang has been an outstanding contributor in the CNCF community since its beginning and is the leader of the cloud native open source team at Huawei. Kevin has contributed critical enhancements to Kubernetes, led the incubation of the KubeEdge, Volcano, Karmada projects in CNCF... Read More →
Long Xu is a Senior Software Engineer in the Infrastructure Department at Bilibili. He has rich experiences in the Kubernetes field, including scheduling, autoscaling and system stability.
When we started CNCF in 2015 to help advance container technology, Kubernetes was the seeding technology to provide a de facto container orchestration platform for all cloud native applications. Almost a decade later, the community has exploded with 200+ open source projects building on top of cloud native technologies. Looking ahead, what challenges will we have in the next decade? What gaps remain for users and contributors? And how do we evolve to meet the demands of an increasingly complex and connected world?
Let us review some of the key CNCF projects today and lay out some possible avenues for where cloud native is going for the next decade, AI, agentic network, sustainability and beyond.
Lin is the Head of Open Source at Solo.io, and a CNCF TOC member and ambassador. She has worked on the Istio service mesh since the beginning of the project in 2017 and serves on the Istio Steering Committee and Technical Oversight Committee. Previously, she was a Senior Technical... Read More →
Kubernetes admins often struggle to understand pod activities, both for regular pods and those with various privileges. This session explores two use cases that highlight why Tetragon, an eBPF-based observability and enforcement tool, for pod security: 1.Replacing Auditbeat with Tetragon: Learn how Auditbeat rules mapped to Tetragon tracing policies, identifying functionality gaps, and how eBay contributed back to the community 2.Auditing Container Process Permissions: See how Tetragon helped analyze pod behavior and determine if applications could migrate to more restrictive pod security policies, ensuring adherence to the principle of least privilege We also cover deployment challenges, such as integrating with SIEM platforms, resource utilization, and implementing runtime enforcement for unwanted pod behavior. This talk provides practical insights into using Tetragon for observability, policy refinement, and improving overall pod security posture in Kubernetes environments.
While SaaS AI providers like OpenAI offer convenient LLM services, they come with significant drawbacks: high costs, lack of customization, lack of privacy, and usage limitations that can throttle high-volume applications.
This presentation shows how a leading e-commerce web site deployed a highly customized suite of LLM applications on private cloud infra, reducing costs by 90% while maintaining complete control over scalability and quality of service. We'll discuss the technology stack for orchestrating inference workloads on cloud GPUs, and explore practical strategies for building stable, scalable, high-performance AI apps on your own private cloud infra.
Lv Yi is the CTO of 5miles, a leading e-commerce platform in the United States. With 19 years in IT, he is a cloud native enthusiast who previously served as a mobile business expert at AsiaInfo. In 2012, he led Zhangyue's systems evolution toward microservices architecture. At 5miles... Read More →
Vivian Hu is a Product Manager at Second State and a columnist at InfoQ. She is a founding member of the WasmEdge project. She organizes Rust and WebAssembly community events in Asia.
With the growing demand for heterogeneous computing power, Chinese users are gradually adopting domestic GPUs, especially for inference. vLLM, the most popular open-source inference project, has drawn widespread attention but does not support domestic chips.Chinese inference engines are still developing in functionality, performance, and ecosystem. In this session, we’ll introduce how to adapt vLLM to support domestic GPUs,enabling acceleration features like PageAttention, Continuous Batching, and Chunked Prefill. We’ll also cover performance bottleneck analysis and chip operator development to maximize hardware potential. Additionally, Kubernetes has become the standard for container orchestration and is the preferred platform for inference services. We’ll show how to deploy the adapted vLLM engine on Kubernetes using the open-source llmaz project with a few lines of code, and explore how llmaz handles heterogeneous GPU scheduling and our practices for monitoring and elastic scaling.
Senior Software Engineer, China Mobile (Suzhou) Software Technology Co., Ltd.
The author has rich experience in cloud-native and AI inference development, currently works at China Mobile, focusing on the research and development of cloud-native and AI inference related products. He shared experiences of service mesh at some technical conferences such as the... Read More →
Kante is a senior software engineer and an open source enthusiast from DaoCloud, his work is mostly around scheduling, resource management and LLM inference. He actively contributes to upstream Kubernetes as SIG-Scheduling Maintainer and helps in incubating several projects like Kueue... Read More →
Sponsor: Akamai Demo: Unleash AI apps with edge-native speed on Akamai Cloud Booth Number: G5
In order to facilitate networking and business relationships at the event, you may choose to visit a third party’s booth or access sponsored content. You are never required to visit third party booths or to access sponsored content. When visiting a booth or participating in sponsored activities, the third party will receive some of your registration data. This data includes your first name, last name, title, company, address, email, standard demographics questions (i.e. job function, industry), and details about the sponsored content or resources you interacted with. If you choose to interact with a booth or access sponsored content, you are explicitly consenting to receipt and use of such data by the third-party recipients, which will be subject to their own privacy policies.
Whether you’re looking to expand your knowledge, connect with experts, or just enjoy a break, the Solutions Showcase is the place to be:
- Exhibits: Visit our sponsor booths to learn about the latest technologies and services. - CNCF Project Tables: Interact with project maintainers and gain insights into community engagement. - Attendee T-Shirt Pick-up: Grab your free conference t-shirt. - Coffee + Tea, Snacks, Lunch Pick-up: Enjoy delicious treats served in the Solutions Showcase.
In order to facilitate networking and business relationships at the event, you may choose to visit a third party’s booth or to access sponsored content. You are never required to visit third-party booths or to access sponsored content. When visiting a booth or participating in sponsored activities, the third party will receive some of your registration data. This data includes your first name, last name, title, company, address, email, standard demographics questions (i.e. job function, industry), and details about the sponsored content or resources you interacted with. If you choose to interact with a booth or access sponsored content, you are explicitly consenting to receipt and use of such data by the third-party recipients, which will be subject to their own privacy policies.
Imagine your cloud-native applications as a bustling city. To ensure everything runs smoothly, you need to test its resilience by introducing controlled chaos, like planned roadblocks, to spot and fix weaknesses before they cause real trouble.
Join the LitmusChaos team, the folks behind this CNCF Incubating project, as they share the latest and greatest in chaos engineering. They'll walk you through new features from recent updates, like better resilience testing, improved observability, and scalability tools, all designed to tackle the real-world problems developers and SREs face daily.
You'll also get the inside scoop on the project's growth, how the community is shaping its future, and a sneak peek at what's coming next to make chaos engineering easier and more effective.
Sayan Mondal is a Senior Software Engineer II at Harness, building their Chaos Engineering platform and helping them shape the customer experience market. He's the maintainer of a few open-source libraries and is also a maintainer and community manager of LitmusChaos (the Incubating... Read More →
gRPC’s performance advantages hinge on minimizing latency, but its binary protocol and streaming capabilities make debugging and monitoring inherently opaque. While distributed tracing identifies bottlenecks, metrics like error rates and throughput are critical for holistic insights. Yet, manual instrumentation for these signals in gRPC is complex, error-prone, and lacks standardization.
In this talk, Purnesh Dixit from the gRPC team unveils the new OpenTelemetry plugin for gRPC, developed by the gRPC team at Google, which provides unified metrics and tracing out-of-the-box to monitor retries, diagnose streaming bottlenecks, and optimize performance without invasive code changes. 1) Client-per-call: Track overall RPC lifecycle (e.g., grpc.client.call.duration).
Ensuring resilience in control planes is critical for organizations managing infrastructure and applications across multiple regions with Kubernetes. This talk presents a reference architecture for creating a Crossplane-based Global Control Plane, enhanced with k8gb for DNS-based failover and leveraging an Active/Passive setup. We’ll explore how Crossplane’s declarative infrastructure provisioning integrates with k8gb to build robust, scalable, and resilient multicluster environments. Key takeaways include:
- Architecting resilient multiregion control planes with Active/Passive roles - Demonstrating failover mechanisms where the Passive control plane transitions to Active during failures - Strategies for optimizing failover times while maintaining availability
This session will guide attendees through proven methods and real-world challenges of building resilient Global Control Planes, empowering them to manage critical workloads across geographically distributed regions confidently.
Yury is an experienced software engineer who strongly focuses on open-source, software quality and distributed systems. As the creator of k8gb (https://www.k8gb.io) and active contributor to the Crossplane ecosystem, he frequently speaks at conferences covering topics such as Control... Read More →
Peer Group Mentoring allows participants to meet with experienced open source veterans across many CNCF projects. Mentees are paired with 2 – 10 other people in a pod-like setting to explore technical, community, and career questions together.
Bloomberg’s Data Analytics Platform Engineering team supports a wide-range of real-time streaming, large batch ETL, and data exploration use-cases by using Apache Flink, Apache Spark, and Trino across multi-cluster Kubernetes. However, deploying and managing these workflows at scale efficiently can be challenging due to varying resource requirements and uptime needs. For stateful applications like Apache Flink, ensuring recovery and state conservation after downtime is especially important.
This session will discuss how Bloomberg uses Karmada, a multi-cluster management system, to deploy and manage Apache Flink. We’ll also explore how Karmada’s capabilities can be expanded to handle additional data analytics workloads, including Apache Spark and Trino. The session will cover the unique requirements and real-life use-cases for each, including:
- Resource-aware workload scheduling - Custom resource requirements and health interpretation - State conservation during application failover
Ilan Filonenko is an Engineering Group Lead focusing on Cloud Native Data Analytics Infrastructure at Bloomberg - where he has designed and implemented distributed systems at both the application and infrastructure level. Previously, Ilan was an engineering consultant and technical... Read More →
Michas is a senior software engineer and tech lead on Bloomberg’s Streaming Analytics engineering team. The platform, which is running on Kubernetes, serves as the foundation for many of Bloomberg's data streaming use cases. Michas is also a frequent collaborator to the CNCF community... Read More →
In this session, KubeEdge project maintainers will provide an overview of KubeEdge's architecture and its industry-specific use cases. The session will begin with a brief introduction to edge computing and its growing importance in IoT and distributed systems. The maintainers will then delve into the core components and architecture of KubeEdge, demonstrating how it extends Kubernetes' capabilities to manage edge computing workloads efficiently. They will share success stories and insights from organizations that have deployed KubeEdge in various edge environments, such as smart cities, industrial IoT, edge AI, robotics, and retail, highlighting the tangible benefits and transformational possibilities. Additionally, the session will introduce the certified KubeEdge conformance test, hardware test, KubeEdge course and certification, discuss advancements in technology and community governance within the KubeEdge project, and share the latest updates on the project's graduation status.
Yue Bao serves as a software engineer of Huawei Cloud. She is now working 100% on open source, focusing on lightweight edge for KubeEdge. She is the maintainer of KubeEgde and also the tech leader of KubeEdge SIG Release and Node. Before that, Yue worked on Huawei Cloud Intelligent... Read More →
Hongbing Zhang is Chief Operating Officer of DaoCloud. He is a veteran in open source areas, he founded IBM China Linux team in 2011 and organized team to make significant contributions in Linux Kernel/openstack/hadoop projects. Now he is focusing on cloud native domain and leading... Read More →
As large language model (LLM) applications are widely deployed, their complex architectures challenge business observability. APM probes, which rely on instrumentation or proxy operation, consume system resources and impact traffic and performance, restricting their use in complex scenarios. Also, multiple teams handling different LLM instances make it hard to coordinate unified observability construction. To solve this, China Mobile‘'s Panji platform collaborates with DeepFlow to achieve zero-intrusion (Zero Code) and full-stack (Full Stack) observability instantly, using eBPF and Wasm technologies. eBPF collects real-time data at the kernel level, while Wasm plugins parse streaming requests. By integrating existing data, the platform provides service universal map, distributed tracing, and multi-dimensional metric analysis, ensuring the stability and performance optimization of LLM applications.
Dr. Shang Jing, Chief Expert at China Mobile Group, has over 20 years of experience in IT system development, construction, and operation. Specializing in big data and cloud technologies, she led the development of China Mobile's Wutong Big Data Platform. Under her leadership, the... Read More →
Starting from graduate school at Huazhong University of Science and Technology in 2013, I joined Tencent Cloud virtual network team in 2016, which provided me with in-depth theoretical knowledge and practical experience in cloud networks. In 2018, I joined YUNSHAN Networks as PM... Read More →
Not everything can be thought about while designing or developing the applications, and as such lot of the design decisions are based on estimates and potential usage patterns.
More often that not, these estimates differ from reality and introduce inefficiencies in the system across several fronts - and if at all visible, it always much later in the lifecycle when you already have several customers & high footprint.
And hence, unless there is a clear sign of performance degradation or unjustified costs, there is often no incentive to invest time & effort for some unknown gains.
In this session Yash will outline a real world case study about how they went about building an internal platform for handling several aspects of post deployment challenges like
1. rightsizing opportunities, 2. architecture migrations like moving to serverless, 3. finding right maintenance windows, etc
by using a wide range of metrics, and how impactful these minor optimizations turned out to be.
Yash is working with Google as Software Engineer, and has 9 years of industrial experience with cloud architectures and micro-service development across Google and VMware. He has been a speaker at several international conferences such as KubeCon + CloudNativeCon and Open Source... Read More →
In today's tech landscape, AI drives industry transformation, but enterprises face challenges in AI adoption—diverse hardware, complex workflows, data privacy. OPEA, an open-source enterprise AI platform with modular microservices, offers unified solutions for rapid deployment. Through DeepSeek inference appliance case, see how OPEA integrates with IT infrastructure, optimizes performance, and enhances reliability. Discover the new "Powered by OPEA" certification for confident AI deployment.
CSAL is Cloud Storage Acceleration Layer for BigData and AI. it is open-source user mode FTL, cache and io trace component inside SPDK(upstreamed). It commercially helps Alibaba cloud storage system. refer https://www.solidigm.com/products/technology/cloud-storage-acceleration-layer-write-shaping-csal.html. Alibaba and Solidigm joint top computer conference paper Eurosys2024 https://dl.acm.org/doi/pdf/10.1145/3627703.3629566 Session Topics: This session is joint development with NVIDIA DPU team and BeeGFS 1. CSAL leverage DPU DRAM as CSAL write buffer who achieve best storage latency ever also promise the data consistency. 2. QLC high density storage is favorable by AI industry since it save power and space for AI Data Center. DPU storage solution can achieve same thing, it is great combine two things together. 3. CSAL bring advanced storage IO shaping, caching and data placement SW into NVIDIA DPU DOCA storage SW service, 4. DPU and CSAL and BeeGFS experiment data sharing and report
Wayne Gao is a Principal Engineer as Storage solution architect and worked on CSAL from PF to Alibaba commercial release. Wayne also takes main developer effort to finish CSAL pmem/DSA and cxl.mem PF from intel to Solidigm. Before joining Intel, Wayne has over 20 years of storage... Read More →
I will share the progress of the Ingress-NGINX project in this topic, as well as our newly incubated project, Ingate. Ingate is a project we created to actively adopt the Gateway API, and we will explore the next steps in the Ingate project based on the successes and failures we've experienced in the Ingress-NGINX project, along with user demands for frequently used features.
CNCF Ambassador, Kubernetes Ingress-NGINX maintainer, Kong Inc.
Jintao Zhang is a Microsoft MVP, CNCF Ambassador, Apache PMC, and Kubernetes Ingress-NGINX maintainer, he is good at cloud-native technology and Azure technology stack.
ou might already be using a CI/CD solution, but are you 100% sure things will roll out without a glitch once you go to production? Unfortunately differences between testing/staging and production environments are virtually unavoidable. There’s always a risk for unforeseen issues related to your production environment and/or actual load which can lead to potential disruptions to your users.
Progressive delivery is the next step after Continuous Delivery to roll out your application in a controlled and automated way so you can verify and test your application *in production* before it becomes fully available to all your user bases.
Embrace GitOps and Progressive Delivery with techniques like blue-green, canary release, shadowing traffic, dark launches and automatic metrics-based rollouts to validate the application in production using Kubernetes and tools like Istio, Prometheus, ArgoCD, and Argo Rollouts.
Come to this session to learn about Progressive Delivery in action using Kubernetes.
Kevin is a Java Champion, software engineer, author and international speaker with a passion for Open Source, Java, and Cloud Native Development & Deployment practices. He currently works as developer advocate at Red Hat where he gets to enjoy working with Open Source projects and... Read More →
As the multi-cluster pattern continues to evolve, managing K8s identities, credentials, and permissions for teams and multi-cluster apps, such as Argo and Kueue, has become a hassle, typically involving managing individual service accounts on each cluster and passing credentials around. Such setup is often scattered, repetitive, difficult to track/audit, and may impose security and ops complications. This is especially true with hybrid environments, where different solutions could be in play across platforms.
This demo presents a solution based on OpenID, SPIFFE/SPIRE, and Cluster Inventory API from the Multi-Cluster SIG that provides a unified, seamless, and secure auth experience. Facilitated by CNCF multi-cluster projects, OCM and KubeFleet, attendees could be inspired to leverage open source solutions to eliminate credential sprawl, reduce operational complexity, and enhance security in hybrid cloud environments, when setting up teams/applications to access a multi-cluster setup.
Chen Yu is a senior software engineer at Microsoft with a keen interest in cloud-native computing. He is currently working on Multi-Cluster Kubernetes and contributing to the Fleet project open-sourced by Azure Kubernetes Service.
Zhu Jian is a senior software engineer at RedHat, a speaker at Kubecon China 2024, and a core contributor to the open cluster management project. Jian enjoys solving multi-cluster workload distribution problems and extending OCM with add-ons.
Strong communities foster a feeling of belonging by providing opportunities for interaction, collaboration, and shared experiences. We hope to do just that with a gathering of attendees who identify as women and non-binary individuals at KubeCon + CloudNativeCon China! Join fellow women community members for networking and connection.
As AI tackles increasingly complex tasks, traditional LLMs show limitations in action decision-making and multi-step reasoning, making autonomous planning and dynamic correction key challenges. ZTE's Co-Sight agent system addresses this with a multi-agent (Plan-Actor) collaborative architecture. Its dual-level design separates planning (task decomposition, path generation) from execution, significantly reducing LLM search space. Dynamic task adjustment is achieved via DAG parallel thinking, dynamic context, guardrails, and hierarchical reflection. Co-Sight has demonstrated excellent performance on the GAIA benchmark, particularly showcasing superior stability in complex Level 2 multi-step tasks.
Recently, the health of open-source projects, particularly, vendor diversity and neutrality, has become a key topic of discussion. Many projects have faced challenges due to a lack of vendor diversity, threatening their sustainability. It is increasingly clear that setting up the right governance structure and project team during a project’s growth is critical. KubeEdge, the industry's first cloud-native open-source edge computing project, has grown from its initial launch in 2018 to achieving CNCF graduation this year. Over the past few years, KubeEdge has evolved from a small project into a diverse, collaborative and multi-vendor open-source community In this panel, we will discuss the lessons learned from KubeEdge community graduation journey, focusing on key strategies in technical planning, community governance, developer growth, and project maintenance. Join us to explore how to build a multi-vendor and diverse community, and how to expand into different industries.
Huan is an open source enthusiast and cloud native technology advocate. He is currently the CNCF ambassador, and TSC member of KubeEdge project. He is serving as experienced technical director for HarmonyCloud.
KubeEdge TSC Member, Senior Software Engineer at Huawei Cloud. Focusing on Cloud Native,Kubernetes, Service Mesh, EdgeComputing, EdgeAI and other fields. Currently maintaining the KubeEdge project which is a CNCF graduated project. And has rich experience in Cloud Native and EdgeComputing... Read More →
KubeSphere founding member, KubeEdge TSC member, Director of Cloud Platform, QingCloud Technologies
Benjamin Huo leads QingCloud Technologies' Architect team and Observability Team. He is the founding member of KubeSphere and the co-author of Fluent Operator, Kube-Events, Notification Manager, OpenFunction, and most recently eBPFConductor. He loves cloud-native technologies especially... Read More →
Yue Bao serves as a software engineer of Huawei Cloud. She is now working 100% on open source, focusing on lightweight edge for KubeEdge. She is the maintainer of KubeEgde and also the tech leader of KubeEdge SIG Release and Node. Before that, Yue worked on Huawei Cloud Intelligent... Read More →
Hongbing Zhang is Chief Operating Officer of DaoCloud. He is a veteran in open source areas, he founded IBM China Linux team in 2011 and organized team to make significant contributions in Linux Kernel/openstack/hadoop projects. Now he is focusing on cloud native domain and leading... Read More →
Kubespray, recognized by Kubernetes' SIG Cluster Lifecycle, deploys production-ready Kubernetes clusters on bare metal, enhancing performance for AI applications with robust GPU support. This session covers Kubespray's fundamentals, key features, and updates.
As AI workloads like LLMs grow, scalable GPU clusters are essential. Engineers will share insights from deploying custom GPU clusters at scale with Kubespray, discussing challenges and best practices. Attendees will learn to integrate Kubernetes technologies like LWS, Kueue, Gateway API Inference Extension, DRA, and tensor parallelism to enhance AI workloads like RAG and LoRA, improving resource utilization and performance.
We'll share Kubespray's inventory source code to customize AI clusters and use Kubernetes operators to define infrastructure in private clouds, enabling efficient cluster scaling.
Rong is a software engineer at vivo developing platform services on top of Kubernetes, providing containerized infrastructure. Focus on the closed loop system of scheduling、gpu technology、network and cluster management.
Kay Yan is kubespray maintainer, containerd/nerdctl maintainer. He is the Principal Software Engineer in DaoCloud, and develop the DaoCloud Enterprise Kubernetes Platform since 2016.
Do you think platform engineering is too hard? Or is it just a buzzword? Is the CNCF landscape too tricky to visualize? If you’ve been in this industry long enough, you should know that platform engineering has been around for a long time.
Most of us have been trying to build developer platforms for decades, and most of us have failed at that. That begs the questions: “What is different now?” “Why will this time be different?” and “Do we have a chance to succeed?”
We’ll take a look at the past, the present, and the future of platform engineering. We’ll see what we were doing in the past, what we did wrong, and why we failed. Further on, we’ll see what we (the industry as a whole) are doing now and, more importantly, where we might go from here.
Get ready for the hard truths and challenges you will face when trying to build a platform based on Kubernetes. Join us for a pain-infused journey filled with challenges teams will face when building platforms to enable other teams.
Viktor Farcic is a lead rapscallion at Upbound, a member of the CNCF Ambassadors, Google Developer Experts, CDF Ambassadors, and GitHub Stars groups, and a published author. He is a host of the YouTube channel DevOps Toolkit and a co-host of DevOps Paradox.
Mauricio works as an Open Source Software Engineer at @Diagrid, contributing to and driving initiatives for the Dapr OSS project. Mauricio also serves as a Steering Committee member for the Knative Project and Co-Leading the Knative Functions initiative. He published a book titled... Read More →
Maximizing security in multi-tenant clusters while maintaining cost-effectiveness is crucial for enterprise OPS. Most enterprise clusters deploy multiple daemonsets, which are attractive targets for attackers seeking to escape and move laterally, ultimately taking over the entire cluster.
The SIG community has introduced several advanced security features recently, such as CRD Field Selectors, Field and Label Selector Authorization, validating admission policy (VAP), and Structured Authorization Config. These allow users to define more flexible authorization configurations, addressing filtering and authorization needs for CRDs, kubelet, and other resources in multi-tenant environments.
We will share the lessons learned from the node escape incidents and demonstrate how to implement these new features and show how to use the Common Expression Language (CEL) to configure customized policies in Authorization Webhook and VAP, resulting more node-specific restrictions within clusters.
Dahu Kuang is a Security Tech Lead on the Alibaba Cloud Container Service for Kubernetes (ACK) team, focusing on the design and implementation of container security-related work, especially within the context of secure supply chain.
Cheng Gao, Senior Security Engineer at Alibaba Cloud, focuses on the Security Development Lifecycle (SDL) for cloud-native applications. With expertise in container services, observability, and Serverless architectures, Cheng has led security assurance for several internal container... Read More →
With the development of AI technology, the demand for computing power for large model training has accelerated the deployment of AI infrastructure. Data centers often have a "resource wall" problem between AI acceleration hardware of different generations and manufacturers, which caused the incompatibility issue of software and hardware stack. Thus, it’s a big challenge for AI infra operators to maximize resource utilization. This topic focuses on technical solutions for collaborative training using chips of different architectures, sharing the practices on solving key problems such as heterogeneous training task splitting, heterogeneous training performance prediction, and heterogeneous hybrid communication and etc.. The project has been open sourced and will be further improved with better maturity through the community.
Join this interactive session for a brief overview of the Cloud Native Computing Foundation (CNCF) Technical Oversight Committee (TOC), including recent initiatives and opportunities to get involved. Learn how the TOC is helping shape the next decade of cloud native technologies, and how you can get involved. Following the overview, we’ll open the floor to your questions—whether they’re technical, or about building leadership within CNCF. Initial seeding questions include:
What are some of the latest Cloud Native AI initiatives?
How can we encourage more CNCF and TAG contributions from Asian countries?
What are the possible paths to becoming a CNCF TOC member?
Technical Expert, Lead of Cloud Native Open Source, Huawei
Kevin Wang has been an outstanding contributor in the CNCF community since its beginning and is the leader of the cloud native open source team at Huawei. Kevin has contributed critical enhancements to Kubernetes, led the incubation of the KubeEdge, Volcano, Karmada projects in CNCF... Read More →
Lin is the Head of Open Source at Solo.io, and a CNCF TOC member and ambassador. She has worked on the Istio service mesh since the beginning of the project in 2017 and serves on the Istio Steering Committee and Technical Oversight Committee. Previously, she was a Senior Technical... Read More →
Chris Aniszczyk is an open source executive and engineer with a passion for building a better world through open collaboration. He's currently a CTO at the Linux Foundation focused on developer relations and running the Open Container Initiative (OCI) / Cloud Native Computing Foundation... Read More →
Training trillion-parameter AI models requires significant GPU resources, where any idle time leads to increased costs. Maintaining full-speed GPU utilization is crucial, yet hardware and software failures (such as firmware, kernel, or hardware issues) often disrupt large-scale training. For example, LLaMA3 experienced 419 interruptions over 54 days, with 78% due to hardware issues, underscoring the necessity for automated anomaly recovery. At Ant Group, we will share: GPU Monitoring: Comprehensive monitoring from hardware to applications to ensure optimal performance. Self-Healing for Large GPU Clusters: Automated fault isolation, recovery from kernel panics, and node reprovisioning for clusters with 10,000+ GPUs. Core Service Level Objectives (SLOs): Achieving over 98% GPU availability and more than 90% automatic fault isolation. Predictive Maintenance: Using failure pattern analysis to reduce downtime and improve reliability.
Yang Cao Senior Engineer, Ant Group Yang Cao is a senior engineer at Ant Group, currently focusing on ensuring the stability of large-scale distributed training on Kubernetes.
When you're new to Kubernetes, Policy as Code (PaC) can be a very unfamiliar topic. But as you get more familiar with Kubernetes, you'll probably be interested in how you can use it securely, especially since Kubernetes is essentially a declarative system via YAML, so having security also be done in code will help with usability and reducing human error.
In order to make PaC easier to understand, I'll demonstrate the Admission Control part directly in Kubernetes. Until recently, this part was based on webhooks, but since v1.23, the decision to actively embrace the Common Expression Language (CEL) has made it possible to apply it as code directly inside Kubernetes. Validating Admission Policy became GA in v1.30, and Mutating Admission Policy is in Alpha in v1.32.
Based on this outline, I'll talk about how PaC has been applied to Kubernetes in the past, how it works today, and finally, how we can expect it to be integrated into Kubernetes in the future.
Hoon Jo is Cloud Solutions Architect as well as Cloud Native engineer at Megazone. He has many times of speaker experience for cloud native technologies. And spread out Cloud Native Ubiquitous in the world. He has written several books and latest books is 『CONTAINER INFRASTRUCTURE... Read More →
Constructing and managing platforms for diverse teams and workloads presents a significant challenge in today's cloud-native environment. This session introduces the concept of composable platforms, using modular, reusable components as the foundation for platform engineering. This talk will demonstrate how using Kratix, a workload-centric framework, and Backstage an extensible developer portal enables the creation of self-service platforms that balance standardization with adaptability.
The session will detail platform design for scalability and governance, streamlining developer workflows through Backstage, and using Kratix Promises for varied workload requirements. Attendees will gain practical insights into building scalable and maintainable platforms through real-world examples, architectural patterns, and a live demonstration of a fully integrated Kratix-Backstage deployment.
Hossein is an experienced cloud computing professional with nearly a decade of expertise in distributed systems and cloud technologies. He began as a student specializing in cloud automation and progressed to a full-time role focusing on on-premises cloud infrastructure and containers... Read More →
AI developer in K8S: either in Jupyter notebook or LLM serving: Python Dependency is always a headache : - Prepare a set of base Images? The maintenance amounts & efforts will be a nightmare: Since (1) packages in AI world are rapidly version bumping, (2) diff llm codes require diff packages permutation/combination. - Leave users to `pip install` by themselves ? The resigned waiting blocks productivity and efficiency. You may agree if you did it. - If on a GPU Cloud, the pkg preparation time may even cost a lot: you rent a GPU but wasted in waiting pip downloading... - you may choose to D.I.Y: docker-commit your own base-images, but you have to worry about the Dockerfile, registry and additional cloud cost if you don't have local docker env.
---- So we introduce https://github.com/BaizeAI/dataset.
The solution: 1. A CRD to describe the dependency and env. 2. K8S Job to pre-load the packages. 3. PVC to store and mount 4. `conda` to switch from envs 5. share between namespaces
Cloud native developer, AI researcher, Gopher with 5 years of experience in loads of development fields across AI, data science, backend, frontend. Co-founder of https://github.com/nolebase
In the AIera, enterprises need to collect more data to build high-quality AI applications, including structured data (databases, data warehouses, etc.) and unstructured data (data lakes, document libraries, real-time data, etc.). Data integrity and compliance play a key role in building AI applications, which is the value of metadata. Providing AI users with a unified data view so that they can better discover and use multi-source heterogeneous data, including data discovery, data semantics, data lineage, data permissions, etc., and managing the data life cycle in combination with enterprise governance needs to avoid resource waste and security issues, has become a strong need for every enterprise.
Apache Gravitino provides a unified API to access multiple data sources and multiple data storages, supports multiple data engines and machine learning frameworks to access data, and implements unified naming, unified permissions, unified lineage, unified auditing and other functions based on unified metadata, thereby greatly simplifying the data operation and breaking the data silos. At present, it has been adopted by companies such as Xiaomi, Bilibili, Pinterest, and Uber, and has achieved good results. This session will introduce the background, architecture, core functions and use cases of Gravitino.
In the rapidly evolving landscape of cloud computing and microservices architecture, efficiently and securely managing communication between services has become a critical challenge. Traditional methods of network traffic authentication often become a performance bottleneck, especially when handling large-scale data flows. This session introduces an innovative solution — leveraging Linux kernel technology XDP (eXpress Data Path) to achieve efficient traffic authentication for service-to-service communications.
We will delve into how to use XDP for rapid filtering and processing of packets before they enter the system's protocol stack, significantly reducing latency and enhancing overall system throughput. Additionally, we will share practical application experiences from projects such as Kmesh, including but not limited to performance tuning, security considerations, and integration with other network security strategies.
Operating system engineer of Huawei Technologies Co., Ltd., core member of Kmesh, contributor of libxdp. Enthusiastic about cloud native technology and eBPF-based high performance network.