Name: Stability in Large Model Training: Practices in Software and Hardware Fault Self-Healing - Yang Cao, Ant Group
Start: 2025-06-11T15:30:00+0800
End: 2025-06-11T16:00:00+0800

10-11 June
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon China 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Hong Kong Standard Time (UTC+8:00). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change and session seating is available on a first-come, first-served basis.

Wednesday June 11, 2025 15:30 - 16:00 HKT

Level 19 | Crystal Court II

Training trillion-parameter AI models requires significant GPU resources, where any idle time leads to increased costs. Maintaining full-speed GPU utilization is crucial, yet hardware and software failures (such as firmware, kernel, or hardware issues) often disrupt large-scale training. For example, LLaMA3 experienced 419 interruptions over 54 days, with 78% due to hardware issues, underscoring the necessity for automated anomaly recovery.
At Ant Group, we will share:
GPU Monitoring: Comprehensive monitoring from hardware to applications to ensure optimal performance.
Self-Healing for Large GPU Clusters: Automated fault isolation, recovery from kernel panics, and node reprovisioning for clusters with 10,000+ GPUs.
Core Service Level Objectives (SLOs): Achieving over 98% GPU availability and more than 90% automatic fault isolation.
Predictive Maintenance: Using failure pattern analysis to reduce downtime and improve reliability.

Speakers

Yang Cao

senior engineer, Ant Group

Yang Cao Senior Engineer, Ant Group Yang Cao is a senior engineer at Ant Group, currently focusing on ensuring the stability of large-scale distributed training on Kubernetes.

stability in large model training practices in software and hardware fault self healing pdf

Wednesday June 11, 2025 15:30 - 16:00 HKT
Level 19 | Crystal Court II

Cloud Native Experience

Content Experience Level Intermediate
Presentation Language Chinese

Need help? View Support Guides
Event questions? Contact Event Planner

KubeCon + CloudNativeCon China 2025

Yang Cao

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!