Loading…
10-11 June
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00)To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis. 
Tuesday June 10, 2025 13:45 - 14:15 HKT
End-to-end large model training is crucial for advancing autonomous driving technology. Horizon Robotics leads in this field by leveraging deep learning algorithms and chip design. They efficiently train and deploy advanced perception models like Sparse4D using cloud-native technologies.
Training these models poses challenges, such as managing massive video data and numerous small files. Ensuring high-performance training with over 2000 GPUs on RDMA, quickly identifying different failures, and diagnosing issues in large-scale training.
This session covers how Horizon Robotics manages large-scale training on Kubernetes. It highlights the role of distributed data caching, network topology awareness, and job affinity scheduling in optimizing a 2000 GPU training job. We'll also discuss strategies for restoring interrupted training jobs through backup machine replacement to enhance task resilience. Furthermore, experiences with CNCF projects like Volcano, Fluid, and NPD will be shared.
Speakers
avatar for Zhihao Xu

Zhihao Xu

Software Engineer, Alibaba Cloud
Zhihao Xu is currently a software engineer at Alibaba Cloud focusing on infrastructure for AI model training and large-scale model inference. Also, he is now a Maintainer of the CNCF sandbox project Fluid, which is designed for data orchestration for data-intensive applications running... Read More →
avatar for Chen Yangxue

Chen Yangxue

Software Engineer, Horizon Robotics
I'm Chen Yangxue, a software engineer at Horizon Robotics. With years of cloud - native experience, I'm building a ten - thousand - card training platform with a hybrid cloud setup.I've used tools like Kubernetes, Volcano, etc., to solve tough technical problems. I know how to optimize... Read More →
Tuesday June 10, 2025 13:45 - 14:15 HKT
Level 19 | Crystal Court I
  AI + ML
  • Content Experience Level Any
  • Presentation Language Chinese

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link