5 Challenges: How to make your Lakehouse the reservoir that powers GenAI and BI success
Data Integration has become a key focus for organizations aiming to unlock value from their rapidly growing data. Cloud-scale data stores – databases, file stores, and the range of big data types – have led many to adopt a data lake house platform, Snowflake and Databricks being the most prominent among the many options.
While such an approach offers scalability and flexibility, the journey from raw data to actionable insights faces a range of technical challenges, old and new. Careful planning and awareness of the nuances of such challenges are the key to achieving the cumulative effect of AI + BI. Therefore, a more systematic approach to data engineering is needed to ensure the lakehouse is properly configured.
Imagine if we could reduce the work of a new analytics project spanning multiple data sources from multiple months to a single notional week. If your week goes like most such projects, the data preparation would take so long that it would be late Friday afternoon before you were ready to start asking questions.
Here, we explain what we have learned to look for (and watch out for) in evolving through your data journey. We'll also introduce a new approach that we have seen unlock even more value for your data stakeholders—faster, more flexibly, and at a lower cost than simply channeling data for every workload in your data warehouse/data lake/data lakehouse.
Step 1: The Initial Setup — Databricks Lakehouse Integration
For simplicity, we'll start with Databricks as the platform of choice (its pluses and minuses are outside the scope of this post).
Organizations often begin their data journey by consolidating various data sources—ERP, HRIS, CRM, financials, and other business process workloads—into a lakehouse. This provides a unified platform to handle both structured and unstructured data.
To be sure, a Databricks lakehouse makes it less onerous to store large volumes of data. Nevertheless, like the data warehouse strategies of the past 4+ decades, it introduces challenges around data governance, particularly at the field level. Data from different sources often have mismatches in naming conventions, units, and formats, which complicates harmonization.
- Typical Concerns:
- Field-Level Data Governance: A field in one system may represent the same concept but have different formats or granularity compared to another system. For example, date fields may vary in format (YYYY-MM-DD vs. MM/DD/YYYY), or customer data may be inconsistently structured across ERP and CRM systems.
- Federated Data Access: With federated data access, ensuring that each department adheres to the same governance policies becomes a challenge. This results in inconsistent data across various teams, as policies for data access, privacy, and security often vary from one business unit to another. Data access privileges may also conflict with centralized governance, leading to fragmented data control.
- Challenge: These inconsistencies hinder the creation of a unified dataset that can be trusted for accurate BI or AI outputs.
Step 2: Data Engineering and Preparation for AI Readiness
Once the data is ingested into the lakehouse, it must be transformed and standardized. This involves a complex Extract, Transform, Load (ETL) pipeline to ensure the data is ready for AI and BI consumption.
- Typical Concerns: ETL pipelines, particularly in highly dynamic environments, can be fragile and difficult to manage. Spark SQL and Unity Catalog are frequently used for querying and managing metadata, but they introduce their own set of challenges.
These can include:
compatibility issues with different data formats, imited support for certain complex transformations,challenges in maintaining data governance across independent teamspotential performance bottlenecks when dealing with large-scale or highly nested datasets, and more. And lest we forget, ETL pipelines are software code: don't be surprised by headaches in version control, schema drift, etc..:- Field Mapping: Mapping fields across multiple data sources requires precise definitions and transformations, but manual mapping is prone to errors and inconsistency.
- Data Lineage: Ensuring that data transformations are traceable from raw ingestion to the final Gold tables is a constant struggle. Lack of proper lineage tracking leads to issues in data provenance and accountability, especially when discrepancies occur in AI models or BI reports.
- Discoverability: Even when data is cleaned and transformed, discovering the right fields and tables across the lakehouse is cumbersome due to the decentralized nature of the metadata.
- Challenge: These technical challenges make ETL pipelines prone to failure, leading to significant delays and issues in ensuring data accuracy, making it difficult to trust the resulting dataset for further analysis.
Step 3: Preparing Data for use in BI Applications and dashboards
The next step in the process is to derive aggregate data from the transformed datasets to create "Gold" tables or views. These tables should serve as a clean, single source of truth for BI tools or AI models, but this often falls short in practice.
- Typical Concerns: When preparing data for BI and AI, creating accurate aggregates is a key step. However, typical pitfalls occur during this phase, especially when trying to maintain consistency with the “single source of truth” principle.some text
- Aggregates for Data Warehouses and Marts: Aggregates such as sales totals, user activity counts, or customer lifetime value are often miscalculated due to incorrect joins, filtering issues, or inconsistent field definitions across datasets. This leads to discrepancies between reports generated from different departments.
- Single Source of Truth: Data teams often struggle to maintain a reliable single source of truth, as conflicting business logic and varying aggregation methods lead to different results for similar queries across teams.
- Challenge: These inconsistencies create mistrust in the data and lead to fragmented decision-making processes, as different teams may rely on conflicting reports.
Step 4: Medallion Architecture: Pressing Bronze into Silver into Gold
The Medallion architecture in Databricks posits three data tiers – Bronze (raw data), Silver (cleaned data), and Gold (aggregated data) – to structure the transformation pipeline. While it’s a flexible approach, keep[ing these elements in balance as the system scales and conflicts arise.
- Typical Concerns: Inputs to the Bronze layer come from various business processes with different development cycles, leading to synchronization issues. As raw data feeds in asynchronously, engineering teams must ensure timely transformation and delivery, which creates challenges when different business units work at varying paces.some text
- Business Process Conflicts: When one team updates their data pipeline (e.g., changing schema or logic), it may affect downstream processes in unexpected ways. These development cycle mismatches cause delays and introduce data inconsistencies.
- Silver Layer Disconnect: The Silver Layer is often where governance breaks down. Data that is supposed to be cleansed and standardized here can still be missing key metadata or fail to adhere to governance standards, leading to issues when this data reaches the Gold layer.
- Lack of Feedback from Dashboard Consumers: Another common issue is the lack of feedback from BI and dashboard consumers. Without proper channels to relay insights back to data engineers, the Gold layer often doesn’t reflect the evolving needs of the business, leading to stagnation and irrelevant datasets.
- Challenge: Misalignment across teams, poor governance at the Silver level, and lack of collaboration with BI consumers slow down the pipeline, leading to outdated and untrustworthy data in the Gold layer.
Step 5: Compliance and Data Governance Challenges
At every stage of the data lifecycle, compliance is critical. However, many organizations struggle to maintain proper data governance, which leads to serious compliance risks.
- Typical Concerns: Compliance risks are particularly high when handling sensitive data, such as Personally Identifiable Information (PII) or Protected Health Information (PHI). With evolving regulations such as GDPR, CCPA, and data sovereignty laws, managing this data effectively becomes critical.
- Data Sovereignty: Organizations must ensure that data remains within its region of origin. However, as teams access data across regions, maintaining sovereignty rules becomes challenging.
- Data Breaches and RBAC: Role-Based Access Control (RBAC) is essential to prevent data breaches and restrict data access to only authorized personnel. However, inconsistent implementation of RBAC policies often results in over-privileged data access, increasing the risk of breaches.
- Privileged Data Access: Improper management of data access rights and privileges can lead to sensitive data being exposed to unauthorized users, further exacerbating compliance risks.
- Challenge: These governance and compliance issues, if left unchecked, can result in legal penalties, reputational damage, and significant financial losses.
As the data pipeline grows in complexity, the inability to handle more advanced use cases such as "what-if" scenarios for predictive analytics becomes more and more likely. What's more, it is a feature (not a bug) of analytics architectures that
Without a robust system in place, the data pipeline struggles to support more advanced BI and predictive analytics needs. For example, generating projections or simulating scenarios like future sales trends becomes difficult when the pipeline cannot dynamically handle complex queries or model inputs. The inability to process "what-if" scenarios for predictive analytics and projections leaves business stakeholders unable to adapt their analytic data leverage in a timely fashion, limiting the value of the data and the lakehouse built to contain it.
Enabling a virtuous data cycle with Claritype and AI
As data pipelines become increasingly complex, businesses need a simplified solution that can handle the technical, governance, and compliance challenges outlined above. Claritype brings a new and interesting approach to the problem, applying advanced classification algorithms to heterogeneous data sources to make them more flexible and more approachable for the wide range of analytics users. Key features:
- AI-Powered Active Data Integration: Claritype automates the entire data pipeline, integrating seamlessly with Databricks lake houses to handle data ingestion, transformation, and export without manual intervention.
- No-Code Data Exploration: Analysts can explore data without needing to write complex queries, making it easier for business stakeholders to derive insights directly.
- Compliance and Metadata Management: Claritype provides built-in governance tools, ensuring that PII, PHI, and data sovereignty concerns are addressed automatically, while maintaining proper metadata management for discoverability.
- What-If Scenarios: By automating data preparation and aggregation, Claritype makes it possible to generate real-time projections and run "what-if" scenarios for advanced predictive analytics.
Looking ahead
Filling a lake house with data does not guarantee its success. The complexity of managing data pipelines in your lake house requires a well-engineered approach to proper governance, compliance, and automation.
Given the data engineering challenges at every step of the pipeline—from ELT/ELT/streamed raw data ingestion, to BI readiness— adding a solution like Claritype goes a long way to simplifying and automating the process. By reducing manual work, empowering a wide range of end users, and ensuring robust data governance within the scope of existing Business processes and the workloads that support them, Claritype empowers organizations to leverage their data faster and more productively than a lake house platform alone.
And as more users find more and more ways to derive analytic value from your data Lake, the appetite for data and data integration multiplies. Lowering the barriers to systems integration for the transactional and operational sources of your data can pay even more dividends as you continue to upgrade your data infrastructure and workloads, to meet the emerging needs of your markets, customers and end users.