top of page

Navigating the Data Landscape for AI Success: Insights from Starburst

Explore how real-time data access, federated queries, and data products reshape AI development with insights from Starburst's latest survey.



As artificial intelligence (AI) continues revolutionizing industries, organizations face increasing pressure to manage and utilize their data effectively for AI initiatives. A recent survey by Starburst, the open data lakehouse company, sheds light on the critical role of data management in AI success. We spoke with Adrian Estala, VP and Field Chief Data Officer at Starburst, to delve deeper into the survey findings and explore how organizations can optimize their data strategies for AI.


The Critical Role of Real-Time Data Access in AI


The Starburst survey highlighted real-time data access as a crucial factor for AI success. According to Estala, implementing real-time analytics poses several challenges:


  1. Ingesting large volumes of real-time data reliably and cost-effectively

  2. Efficiently integrating 'streaming' data with other data assets

  3. Rapidly discovering and accessing distributed enterprise data


"AI teams need rapid discovery, ideation, and experimentation; there is no time to wait," Estala emphasizes. "The data should be easily accessible to enable data science teams to accelerate their unique AI use cases."


To address these challenges, Starburst offers a solution that enables data science teams to use self-service tools for secure and quick data discovery. The platform can continuously ingest streaming data from various sources into a data lake in near real-time while guaranteeing exactly once semantics. Additionally, Starburst provides over 50 connectors to access data in distributed stores, both on-premises and in the cloud, supporting efficient data discovery and AI prototyping.


Streamlining Data Organization for Machine Learning


With 52% of survey respondents struggling to organize structured data for machine learning, Estala recommends several best practices for data engineers:


  1. Adopt an open, hybrid lakehouse architecture to support AI and business intelligence workloads.

  2. Reduce overhead on building and managing data pipelines for exploratory data analysis by leveraging data federation capabilities.

  3. Once relevant datasets are identified, choose between two approaches:

    1. For high-performance applications, migrate data to a data lake and build dedicated data products.

    2. For internal applications where data can't be moved for compliance reasons, build federated data products with a data lake as the center of gravity.


These practices can help data science teams move faster while reducing the pipeline and governance burden on data engineers during the exploratory stages of AI development.


Leveraging Federated Data Access for AI Innovation


The survey highlighted the importance of federated data access strategies in AI development. Estala explains that most chief data officers have a hybrid architecture, with data distributed across clouds, on-premises systems, and various source systems.


"It's the distributed data that is the secret sauce for companies to take AI from experimentation to impactful business solutions," Estala notes. Starburst's federation capabilities simplify the most challenging part of the AI development lifecycle: data discovery and model prototyping.


By connecting distributed data across lakes, warehouses, databases, and other sources without complex data migration, Starburst makes it significantly easier to test different datasets to optimize model performance. This approach enables teams to drive rapid discovery for ideation and prototyping exercises while encouraging the reuse of purpose-built data products from a marketplace.


Balancing Data Accessibility and Security


Data privacy and security emerged as significant concerns in the survey. Starburst's "Open Hybrid Lakehouse" approach addresses these concerns by allowing organizations to query data within and around the data lake without immediate migrations, thereby reducing exposure risks.


The platform incorporates robust security measures, including:


  • Fine-grained access controls (e.g., column, row, table) via RBAC and ABAC

  • Data encryption

  • Governance policies

  • Data observability capabilities


These features ensure that data remains protected and auditable while still readily accessible for analysis, striking a balance between security and accessibility.


Enhancing Data Literacy for AI Projects


Another key finding from the survey was the importance of data literacy. Estala emphasizes that data literacy programs should cover data management, AI governance, and data ethics. However, he stresses that this training should not be limited to IT developers and engineers but should also include business teams.


"We need to enable the business with self-service capabilities that will allow them to drive their own AI innovation and insight," Estala explains. Starburst's data products help business teams quickly find and understand the data they're putting into their AI engines.


The collaboration between business and technical teams can yield significant ROI when business teams can share the context needed by data engineers. This ensures that the metadata AI models rely on has the necessary context in how business teams frame their questions.


Real-World AI Integration with Starburst


Starburst's customers are leveraging the Open Hybrid Lakehouse platform to integrate AI capabilities into their data infrastructure. Some examples include:


  • Unifying data from disparate systems improves the accuracy of predictive analytics and enhances decision-making processes.

  • Building Data Products as trusted data sources for GenAI tools like OpenAI, as demonstrated by Halliburton.


These use cases showcase how Starburst's solution enables seamless, real-time access to data across multiple sources, allowing for more efficient training and deployment of AI models.


Supporting Hybrid and Multi-Cloud Environments for AI Workloads


With the trend towards cloud-based platforms for scalability, Starburst's solution is designed to support hybrid and multi-cloud environments. The platform enables seamless data access and price-performant query execution across cloud and on-premises systems.


This flexibility allows organizations to leverage data in and around the data lake without replication, providing scalability for AI workloads during prototyping and experimentation. Once ready for production, teams are encouraged to move data to a data lake for optimal performance and scale, depending on the AI-powered application's performance needs and data governance requirements.


Implementing Agile Methodologies for Data and AI Projects


The survey highlighted the adoption of agile methodologies for data project management. Estala emphasizes the importance of data products in implementing agile approaches for data and AI projects.


"Data science and AI teams need the agility to quickly discover and reuse trusted and secure data products to accelerate the delivery of their data projects," he explains. Data products enable teams to:


  • Ideate through different datasets, failing fast until they find what they need

  • Work in a self-service manner and run at market pace

  • Conduct more accessible forensics and auditability due to transparent design logic and fine-grained access control


This approach allows data project teams to work more efficiently and adapt quickly to changing requirements and insights.


Future Trends in Data Management for AI


Looking ahead, Estala notes that Starburst's focus remains on refining and enhancing its core strengths:


  • Helping teams quickly discover, access, analyze, and share relevant data to power AI

  • Leveraging Data Products as a mechanism for experimental and production models

  • Implementing sound governance and security practices for data

  • Ensuring data accessibility while offering the best combination of price and performance within their query engine

  • Maintaining optionality within the data stack through open source and integration with tools that data teams love


These focus areas reflect the ongoing challenges and opportunities in data management for AI, emphasizing the need for flexible, secure, and efficient data solutions.


Conclusion


As organizations continue to navigate the complex landscape of AI development, effective data management emerges as a critical factor for success. Starburst's survey and insights from Adrian Estala highlight the importance of real-time data access, federated queries, and data products in driving AI innovation.


By addressing challenges in data organization, security, and accessibility, companies can create a solid foundation for their AI initiatives. As the field evolves, adopting agile methodologies and staying attuned to emerging trends will be crucial for organizations looking to harness the full potential of their data for AI-driven success.


Comments


bottom of page