Machine Learning Infrastructure at Meta Scale

Sdílet
Vložit
  • čas přidán 16. 08. 2023
  • Speaker:
    Shivam Bharuka
    Senior AI Infra Engineer, Meta
    Shivam has been with Meta as part of the AI Infrastructure team for the last three years. During this time, he has helped scale the machine learning training infrastructure at Meta to support large scale ranking and recommendation models, serving more than a billion users. He is responsible for driving performance, reliability, and efficiency-oriented designs across the components of the ML training stack at Meta.
    Abstract:
    Machine learning models are growing rapidly in scale to support the ranking models at Meta scale. In order to support this growth, we have re-imagined the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. In this talk, I will talk about the challenges we encountered and the approach we took to re-design and scale the stack.

Komentáře •