Let your data be your guide.
I had the opportunity to hear Dan Weeks, Engineering Manager Big Data Compute Team at Netflix during the Subsurface cloud data lake conference. Dan talked about how Netflix has evolved its data infrastructure and storage as data volumes and the tools to get insights have grown.
Netflix has been operating a cloud-based data platform for the last 10 years. Their architectural principles for operating in the cloud include: 1) separating cloud and storage for data centers and the cloud; 2) build data warehouses as a single source of truth for their data; 3) be cloud-native to leverage the elasticity and services to enhance the data platform.
Their file system replicates HDFS behaviors using S3 features. They achieve atomicity by modifying metastore and compute to achieve minimum atomic ops. They achieve consistency by tracking file-level metadata to ensure correctness. The enhance committers to address the semantic differences in behavior. They modify file formats and engines to provide consistent behavior. Scale metadata by optimizing metadata queries and behaviors for large datasets. Integrate by expanding support for access from other systems and platforms. And maintain their system by rebasing changes across all releases and platforms.
Moving forward, they saw continued exposure to complexity and implementation, the need to force the adaptation of technology, and the continuation of high maintenance costs. This resulted in rethinking the entire storage level three years ago. We felt it was necessary to come up with a strategic solution to address these challenges.
This resulted in the development of Iceberg to address the challenge. Iceberg is an open table format for huge analytic datasets. It addresses a lot of the concerns that have been arising over time. Open community standard with a specification to ensure compatibility between languages and implementations.
Here are the storage principles. Separate user and infrastructure concerns. Users will be most affected when working with the data. They do not want to work on infrastructure. Have strong contracts for data and behaviors. Strict understanding of schema, how to evolve, and data types supported. Without surprises and side effects. Things behave as you expect them to. This is key for people to have confidence in their data warehouse.
Here’s how Netflix is thinking about how to build out their data warehouse and the services around its service layer. Compute: 1) integration with many different execution engines, 2) metadata exposed as tables, 3) streaming support for sources and sinks. Access: 1) Java client for applications and lower-level integration, 2) Python client for data science and machine learning, 3) data access service for generalized data access. Janitors: 1) TTL janitors for managing dataset lifetimes, 2) snapshot janitors for version history and time, 3) bucket janitors deleting dangling files. Tuning: 1) relocate data across multiple regions, 2) compact tables to optimize files, 3) restating data for better performance. Analysis: 1) optimizing datasets based on use and audience, 2) modeling changes to the parameters of the datasets, 3) applying configuration changes to tables. All of this for the purpose of having an optimized platform that persists automatically.
Data is the foundation of a data warehouse or platform. If you don’t have a good foundation, it’s going to leak into all the systems and services that revolve around it and that will be difficult to manage, maintain, and scale.
You want strong contracts and open standards for integrating internally and with vendor products to scale into the future as technologies and solutions change.