The Current and Future State of the Cloud Data Lake

The separation of storage, data, and compute provides greater agility, stability, and speed to insight.

I had the opportunity to attend the virtual Subsurface data lake conference. Billy Bosworth, CEO and Tomer Shiran, Co-Founder and CPO of Dremio shared their thoughts on the rise of the cloud data lake.

Billy kicked off the event by sharing how open cloud data lake architectures unlock data to enable access with best-of-breed technologies at the users’ own pace. A cloud data lake enables the insertion of different services, at different layers, at different times. The separation of compute, data, and storage for performance and scale enables an elastic pay-as-you-go model providing access, capability, and agility not previously available.

Tomer went through a brief history where monolithic architectures were the only option 10 years ago. Data warehouses, on-prem or in the cloud, offered a single compute engine and interface which was proprietary and expensive. On-prem data lakes were fixed stacks dictated by Hadoop with fixed storage and compute capacity. Lack of flexibility, monolithic, provided by a single vendor.

In the last few years, there’s been a separation of storage, data, and compute with an always-on data tier built on open source and with open standards for the large public cloud providers available in many different regions. It’s available to every engine, every tool, and every library.

The availability of compute has resulted in a decoupled, elastic, and isolated compute tier. The compute tier is running in a separate place than the data. What the data scientists are working on is separate from the compute running the executive dashboards which resulted in greater stability for everyone.

It starts with the storage layer -- infinitely scalable, highly available, globally distributed, and easy to use and access. For convenience and agility, ensure the data is as accessible and open. The data tier is built on open source and standards are open and efficient.

The compute tier is built on elastic and isolated best-of-breed compute engines: Interactive SQL & BI = Dremio; Spark = Databricks; Batch = EMR; Occasional SQL = Athena; DW Extension = Redshift Spectrum, Snowflake, SQL Date Warehouse/Synapse.

On the interchange tier, Apache Arrow (Flight) will become important for years to come as it provides in-parallel memory more efficiently. Apache Arrow is used by data scientists from the distributed system with a Python or R client at speeds up to 100X to populate a data frame on a client application.

Tomer sees an open future, with separate compute, data, and storage, to access data in a consistent way with S3 and ADLS providing an open, scalable, and always available data tier enabling users to avoid vendor lock-in from monolithic architecture that can access best-in-class compute.

Drop Me a Line, Let Me Know What You Think

© 2020 by Tom Smith | | @ctsmithiii