Training Foundation Models on a Trillion Data Points with Apache Iceberg
Training an AI foundation model on over a trillion data points sounds impossible without hitting your production systems. Here's how Datadog did it with Apache Iceberg for their time series forecasting model TOTO. The key challenge: extracting massive historical observability data (metrics spanning years) and running incremental preprocessing pipelines without overwhelming production services. Iceberg solved this by providing schema governance, consistency guarantees, and seamless integration with ML tools like Ray and PyTorch.