Cloud Data Warehouse vs. Cloud Data Lake

Cloud Data Lakes are growing as data-heavy enterprises take advantage of lower cost, faster time to insight, efficiency, and agility.



I had the opportunity to speak with Jason Nadeau, V.P. of Strategy with Dremio about what he is seeing with regards to how enterprises are handling big data and analytics today.


For the last few years, we’ve seen enterprises migrating to the cloud and now back to a blend on cloud and on-prem with hybrid multi-cloud infrastructure. At the same time, enterprises are dealing with the accelerated three V’s of data - volume, velocity, and variety.


Since COVID-19, we’re seeing more pressure on IT budgets due to economic uncertainty, forcing technology leaders to find ways to accomplish more with less. With data and technology at the heart of the business, it is not possible to simply shut down cloud migrations and data analytics projects. In verticals such as financial services and healthcare, higher volatility and volume are leading to ever-increasing amounts of data that need to be processed and analyzed with fewer resources.


Cloud data warehouses (CDWs) are repositories for structured and filtered data that’s been processed for a specific purpose. Cloud data lakes (CDLs) are a vast pool of raw data, the purpose of which is yet to be defined.


CDLs give enterprises a repository to keep all of their data in one place -- a single source of truth. Today, as more data becomes available from more sources, enterprises are complementing their business intelligence with new datasets to provide more and previously undiscovered insights.


CDLs are outperforming CDWs as they enable enterprises to get more value, from more data, more quickly. Time to insight is quicker and less expensive since data can be queried in place without having to be moved out of the CDL resulting in high egress expense.


Data in a CDL is open to exploration with best-of-breed technologies. Processing engines can be brought to the CDL along with new data sets. This helps to accelerate exploration and insights while keeping costs down.


CDW challenges:

  • Expensive to exfiltrate data

  • Locked-in and hard to migrate off

  • Difficult to take advantage of best-of-breed technology


As an enterprise puts more of its data in a CDW, these challenges become more costly, hindering agility and innovation.


CDL benefits:

  • Complete control of data at all times - easy access, process, and query in the CDL

  • Lower risk of vendor lock-in

  • Data is accessible by more users

  • Ability to mix-and-match best-of-breed technology


In a CDL, an enterprise can use Amazon S3 and ADLS to store data, Dremio and Databricks to process data, and Tableau and Power BI to visualize data.


The predominant use cases Jason and his team are seeing across multiple vertical industries are business intelligence with a consistent semantic layer and are able to get dashboard and reports very quickly. Data scientists are able to make queries across a wide range of data. Users are able to make ad hoc queries with traditional SQL access.


While Jason does not see CDLs replacing existing CDWs, he does see a rebalancing of data analysis as enterprises pursue new data projects in CDLs to take advantage of the lower cost, fast time to insight, efficiency, and agility.


If you'd like to learn more about CDLs and become involved in a community of CDL professionals, check out Subsurface Live on July 30. They'll have industry experts sharing their knowledge about the current state of CDL architecture and use cases.


Drop Me a Line, Let Me Know What You Think

© 2020 by Tom Smith | ctsmithiii@gmail.com | @ctsmithiii