Scaling AI Infrastructure at OpenAI
In the last year, OpenAI trained a team of five neural networks to defeat the reigning world champion esports team, achieved state-of-the-art results on a variety of domain-specific language modeling tasks, released a public demonstration of combining multiple musical styles using unsupervised learning, and more.
_x000D_To deliver on these results, OpenAI operates a wide range of complex infrastructures, including some of the largest Kubernetes clusters in the world. Many of the workloads running on this infrastructure don’t adhere to commonly accepted practices or ways of operating software.
_x000D_This talk covers the infrastructure patterns and techniques OpenAI uses to successfully scale their research. Audience members will come away with ideas and practices to help with their own ML/AI teams—in both development and deployment, from research through to production.