BI Directly on Cloud Data Lakes Reduces Data Warehouse Costs

Low-latency query technology accelerates BI dashboard queries directly on Amazon S3 and Azure data lake storage.



I had the opportunity to meet with Tomer Shiran, Co-founder and Chief Product Officer at Dremio to discuss the latest updates to their data lake engine. I last heard Tomer as he was discussing the rise of the cloud data lake this summer at their cloud data lake conference.


According to Tomer, these improvements have been a year in the making and are timely in the new WFH, and compute from the home environment. The new innovations deliver sub-second query response times on cloud data lakes and support for thousands of concurrent users and queries. In addition, Dremio now includes built-in integration with Microsoft Power BI, enabling users to instantly launch the data visualization software from Dremio and immediately start querying data via a direct connection.

The latest Dremio product release enables companies to run production BI workloads, including interactive dashboards, directly on Amazon S3 and Azure Data Lake Storage (ADLS) — without having to move data into data warehouses, cubes, aggregation tables, or extracts. The new capabilities deliver simple, self-service access to data and enable analysts to see results immediately, eliminating their dependency on manual ETL processes or data engineering while reducing the costs associated with data warehousing.

“The fact that organizations don't need to copy their data into a data warehouse for BI workloads has been unthinkable for the last 30 years,” said Tomer. “Today, our users can leverage Dremio to power live dashboards and reports directly on S3 and ADLS, instead of waiting weeks to have data moved into a data warehouse. We’re removing limitations, accelerating time to insight, and empowering data teams.”

Key new features of Dremio’s cloud data lake engine are designed to enable high-concurrency, low-latency SQL workloads, including BI dashboards, directly on the cloud data lake. These include:

  • Apache Arrow caching - Dremio can now cache data reflections (physically optimized representations of data) in the Apache Arrow format so the data can be loaded directly into memory with zero compute processing overhead. This eliminates the need to decode and decompress data at runtime, enabling sub-second query response times for BI dashboards.

  • Scale-out query planning - Dremio supports horizontal scaling for coordinator nodes, in addition to executor nodes, allowing companies to run high-concurrency workloads consisting of thousands of simultaneous users and queries.

  • Runtime filtering - By automatically leveraging runtime intelligence from dimension tables, Dremiodrastically reduces the amount of data that must be read from a fact table. This results in a performance speedup of more than 100x for star schemas, workloads that have traditionally only been run on data warehouses.

  • Enhanced Power BI integration - Microsoft and Dremio have partnered to develop a deeper integration between Power BI and Dremio that enables users to launch Power BI Desktop directly from the Dremio interface with the click of a button. Power BI automatically connects to Dremio using a native connector, so users can easily transition from building a dataset in Dremio to analyzing their data in Power BI.

  • External queries - Dremio enables users to incorporate explicit SQL queries on their relational databases within Dremio virtual datasets. This makes it easy to join data between large datasets in a cloud data lake and smaller datasets in existing relational databases.

Drop Me a Line, Let Me Know What You Think

© 2020 by Tom Smith | ctsmithiii@gmail.com | @ctsmithiii