Artwork

Contenuto fornito da Demetrios Brinkmann. Tutti i contenuti dei podcast, inclusi episodi, grafica e descrizioni dei podcast, vengono caricati e forniti direttamente da Demetrios Brinkmann o dal partner della piattaforma podcast. Se ritieni che qualcuno stia utilizzando la tua opera protetta da copyright senza la tua autorizzazione, puoi seguire la procedura descritta qui https://it.player.fm/legal.
Player FM - App Podcast
Vai offline con l'app Player FM !

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

55:36
 
Condividi
 

Manage episode 415529999 series 3241972
Contenuto fornito da Demetrios Brinkmann. Tutti i contenuti dei podcast, inclusi episodi, grafica e descrizioni dei podcast, vengono caricati e forniti direttamente da Demetrios Brinkmann o dal partner della piattaforma podcast. Se ritieni che qualcuno stia utilizzando la tua opera protetta da copyright senza la tua autorizzazione, puoi seguire la procedura descritta qui https://it.player.fm/legal.

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com

Simon Karasik is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax. Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/ MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints. // Abstract The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing. // Bio Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/ Timestamps: [00:00] Simon preferred beverage [01:23] Takeaways [04:22] Simon's tech background [08:42] Zombie models garbage collection [10:52] The road to LLMs [15:09] Trained models Simon worked on [16:26] LLM Checkpoints [20:36] Confidence in AI Training [22:07] Different Checkpoints [25:06] Checkpoint parts [29:05] Slurm vs Kubernetes [30:43] Storage choices lessons [36:02] Paramount components for setup [37:13] Argo workflows [39:49] Kubernetes node troubleshooting [42:35] Cloud virtual machines have pre-installed mentoring [45:41] Fine-tuning [48:16] Storage, networking, and complexity in network design [50:56] Start simple before advanced; consider model needs. [53:58] Join us at our first in-person conference on June 25 all about AI Quality

  continue reading

376 episodi

Artwork
iconCondividi
 
Manage episode 415529999 series 3241972
Contenuto fornito da Demetrios Brinkmann. Tutti i contenuti dei podcast, inclusi episodi, grafica e descrizioni dei podcast, vengono caricati e forniti direttamente da Demetrios Brinkmann o dal partner della piattaforma podcast. Se ritieni che qualcuno stia utilizzando la tua opera protetta da copyright senza la tua autorizzazione, puoi seguire la procedura descritta qui https://it.player.fm/legal.

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com

Simon Karasik is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax. Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/ MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints. // Abstract The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing. // Bio Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/ Timestamps: [00:00] Simon preferred beverage [01:23] Takeaways [04:22] Simon's tech background [08:42] Zombie models garbage collection [10:52] The road to LLMs [15:09] Trained models Simon worked on [16:26] LLM Checkpoints [20:36] Confidence in AI Training [22:07] Different Checkpoints [25:06] Checkpoint parts [29:05] Slurm vs Kubernetes [30:43] Storage choices lessons [36:02] Paramount components for setup [37:13] Argo workflows [39:49] Kubernetes node troubleshooting [42:35] Cloud virtual machines have pre-installed mentoring [45:41] Fine-tuning [48:16] Storage, networking, and complexity in network design [50:56] Start simple before advanced; consider model needs. [53:58] Join us at our first in-person conference on June 25 all about AI Quality

  continue reading

376 episodi

Alle Folgen

×
 
Loading …

Benvenuto su Player FM!

Player FM ricerca sul web podcast di alta qualità che tu possa goderti adesso. È la migliore app di podcast e funziona su Android, iPhone e web. Registrati per sincronizzare le iscrizioni su tutti i tuoi dispositivi.

 

Guida rapida