Building AI on Solid Ground: A Practical Approach to Data-Centric AI

Behind every useful AI application is one key ingredient: data. While AI models and algorithms get the spotlight, they can only perform as well as the data that feeds them. Without reliable, well-structured data, even advanced AI struggles to produce meaningful results.

Many organizations still focus on fine-tuning models while treating data as an afterthought (or worse yet, with baseless optimism). But the speed of change of applications and data is a feedback loop that is going to only be accelerated by AI, which can create bigger and bigger gaps and bigger and bigger opportunities for error.

A data-centric approach shifts the focus to making data better—improving its quality, accessibility, and context—so that AI systems can learn more effectively and perform reliably in real-world settings.

That said, focusing on data alone won’t solve every problem. Some industries face unavoidable data gaps and shortages, and improving data quality isn’t always enough to eliminate bias. Still, understanding the key principles of data-centric AI can help businesses make informed decisions about when and how to invest in their data infrastructure.

Before the rise of AI, organizations already faced persistent data quality and governance issues—duplicate records, inconsistent definitions, incomplete datasets, and siloed systems made it difficult to trust data for decision-making. AI has amplified these problems. Generative and machine learning models ingest vast, often unstructured datasets that expose hidden quality flaws and compound the impact of existing gaps in governance. But unlike traditional reporting, AI models don’t just reflect the data—they learn from it, which means errors or biases in data can lead to unpredictable or risky outcomes. For finance leaders and executives, this introduces new compliance, reputational, and operational risks. Without modern, scalable governance practices and real-time data quality monitoring, AI initiatives can underperform—or even backfire.

Gartner has released a report on "Four Key Pillars of a Data-Centric Approach to AI" that emphasizes this crucial point. These pillars are:

Data Quality
Data Accessibility
Data Context
Data Governance

‍

Let's break them down and see how they work to let you build robust and reliable AI systems. But first, let's take a second look at how both humans and processes can introduce unintended errors within and across systems.

It all begins and ends and continues with your data.

What is "Quality Data" Anyway?

Fundamentally, quality data is data that is fit for its intended purpose, which requires it to not only accurately and consistently represent the reality it describes but also to be free from significant noise and systematic bias that could undermine its usefulness.

To achieve fitness for purpose, quality data must meet criteria across three key areas:

Accurate and complete representation: The data must faithfully capture the essential aspects of the real-world phenomena it represents at the appropriate time.
Consistent and interpretable structure: The data must be organized, formatted, and presented in a way that allows it to be reliably understood and used across different contexts.
Freedom from distortion (noise and bias): The data must be minimally affected by random errors (noise) and systematic deviations from the truth (bias).

Accurate and Complete Representation

For data to be fit for purpose, it must first accurately and completely represent the real-world phenomena it describes, reflecting reality at the appropriate time. This fundamental requirement means the data values themselves should be correct and free from error, such as a customer's address precisely matching their actual location. Furthermore, completeness dictates that all necessary information must be present, as missing data in critical fields can render the dataset unusable for certain tasks.

The data's relevance is also intrinsically linked to its timeliness or currency; information must be sufficiently up-to-date for the specific context – for instance, data current as of today might be essential for immediate operational decisions but less critical for long-term historical trend analysis. Equally important is uniqueness, ensuring entities aren't duplicated which could skew aggregates, and validity, meaning the data conforms to predefined rules and formats like plausible ranges or correct data types, ensuring the representation is both non-redundant and adheres to its expected structural constraints.

Consistent and Interpretable Structure

Beyond simply mirroring reality, quality data must possess a consistent and interpretable structure, allowing it to be reliably understood and utilized across different contexts, systems, and by various users. This involves maintaining consistency, where data values, formats, and definitions are uniform within the same dataset and across related datasets, thereby preventing contradictions or ambiguities—for example, consistently using standard codes or units of measure.

The structure also enhances interpretability through relevance, ensuring the included data is appropriate and applicable to the task at hand, avoiding unnecessary complexity. Moreover, accessibility is key; the data must be readily available, retrievable, and understandable to the people or systems needing it, often facilitated by clear documentation or metadata. Finally, structural integrity ensures the data maintains its internal coherence and defined relationships, such as valid links between related records in a database, contributing to its overall trustworthiness and usability.

Freedom from Distortion (Noise and Bias)

A critical, yet distinct, aspect of data quality involves ensuring the data is largely free from significant distortions, specifically by minimizing both random errors, known as noise, and systematic deviations from the truth, known as bias.

Noise manifests as random, often meaningless variations or inaccuracies within the data—such as typos, minor measurement fluctuations, or data entry errors—that can obscure underlying patterns and reduce analytical precision, even if they don't skew results in a particular direction overall.

Bias, conversely, represents systematic, non-random errors where the data is distorted in a specific way, leading it to inaccurately or unfairly represent reality. This is particularly pernicious as it can lead to fundamentally skewed analysis, unfair or discriminatory outcomes, and the reinforcement of existing inequities. Examples range from sampling bias, where data collection methods underrepresent certain groups, to measurement bias from faulty instruments, or algorithmic bias where models learn and perpetuate societal biases present in the training data.

Actively identifying, understanding, and mitigating both noise and bias is essential for ensuring the data provides a truly reliable and trustworthy foundation for insights and decisions.

The Core Principles of Data-Centric AI

Let's look at the idea of quality data in the context of Gartner's four pillars.

1. Data Quality: Accuracy, Completeness, and Bias Awareness

AI systems depend on accurate and well-structured data. "Garbage in, garbage out" is a common saying, but quality isn’t just about fixing errors—it’s about making sure data truly represents what an AI model needs to learn.

Key Considerations

All Data is Not Created Equal: The reality is that most data starts out unfit for purpose, but by cleaning, massaging, and aggregating it, you can make it useful. The open source Delta Lake describes a medallion architecture in which the "bronze" layer is the raw data, "silver" is cleaned data, and "gold" is data that has been analyzed and aggregated, and potentially had calculated fields added.
Beyond Clean Data: Removing duplicates and correcting errors isn’t enough. AI models also need truly representative data to avoid bias. If historical hiring data is biased against certain groups, an AI trained on that data will likely reproduce the same bias.
Context Matters: A dataset may look complete but still lack important context. In healthcare, for example, over-sampling one domain and under-sampling another can make an AI model less effective for the population as a whole. So it's important to understand that data may be easy to find, but it may not be complete.
Ongoing Monitoring: Data quality isn’t a one-time fix. Businesses should track key metrics like missing values, inconsistencies, and bias over time.

Practical Steps

Use data quality dashboards to monitor issues in real-time.
Test AI models for bias and fairness, not just accuracy.
Validate datasets with input from domain experts, not just automated tools.

2. Data Accessibility: Integration Without Overcomplication

AI models perform best when they have access to diverse, well-integrated datasets. However, many organizations still struggle with siloed systems and complex data pipelines.

Key Considerations

Data Sharing vs. Privacy Risks: Not all data can—or should—be easily shared. While integrating customer data across platforms can improve AI-powered recommendations, organizations must also consider privacy laws like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).
The Limits of Automation: AI-powered data platforms promise seamless integration, but they don’t eliminate the need for human oversight. Poorly integrated data can lead to misleading insights.
Not Every Business Needs a Data Lake: Large-scale data platforms sound appealing, but smaller organizations may benefit more from simpler, well-maintained datasets rather than complex, hard-to-manage architectures.

Practical Steps

Build a data catalog to improve discoverability without unnecessary duplication.
Invest in data integration platforms that match the scale of your business.
Establish clear data-sharing policies to balance accessibility and privacy.

3. Data Context: Why Metadata and Human Expertise Matter

AI models don’t just need data—they need to understand what the data means. Without context, an AI can easily misinterpret information and produce misleading results.

Key Considerations

Metadata Isn’t Enough: Knowledge graphs, which store not only bits of information but also the relationships between them, and ontologies, which can serve as a schema for those graphs, help structure relationships between data points, but they can’t replace domain knowledge. A financial AI trained to detect fraud might spot anomalies, but it takes an expert to determine whether they’re truly suspicious.
Explainability Matters: AI decisions should be understandable, especially in high-stakes areas like healthcare or finance. Data without plainly stated, unambiguous meaning can lead to unpredictable, hard-to-explain results.
Human-in-the-Loop Approaches Improve AI: Many AI failures stem from a lack of human oversight. Bringing domain experts into the data process can prevent costly mistakes.

Practical Steps

Invest in metadata management tools, but don’t rely on them alone.
Use explainability techniques (like SHapley Additive exPlanations (SHAP) or Local Interpretable Model-agnostic Explanations (LIME)) to clarify how AI models use data.
Involve domain experts when labeling and structuring data, especially for critical applications.

4. Data Governance: Ethical and Practical Guardrails

AI is only as responsible as the data that trains it. Governance isn’t just about security—it’s about making sure AI systems align with ethical standards and legal requirements.

Key Considerations

Bias Can’t Be Fully Eliminated, but It Can Be Managed: Even with perfect data governance, some bias will remain. The goal is to reduce harm by testing for fairness and regularly updating models.
Explainability is a Compliance Issue: Regulations like GDPR require that AI-driven decisions be explainable. Governance strategies should include tools for tracking how data influences AI outcomes.
Security and Compliance Are Non-Negotiable: Data access should be restricted based on roles to prevent misuse, and encryption should be standard practice for sensitive information.

Practical Steps

Implement bias detection tools and adjust training data accordingly.
Track data lineage to ensure transparency in AI-driven decisions.
Set up role-based access controls to protect sensitive data.

Challenges and Trade-Offs

Shifting to a data-centric approach isn’t always straightforward. Organizations should consider:

The Cost of High-Quality Data: Improving data quality takes time and resources, and not all businesses can afford large-scale data governance initiatives.
Synthetic Data Has Limits: While synthetic data can help with privacy concerns and rare data cases, it may not fully capture the complexity of real-world data.
AI Can’t Fix Broken Data Strategies: If business processes generate poor-quality data, AI won’t magically correct them. Sometimes, fixing upstream processes is more effective than refining AI models.

‍

What’s Next?

A data-centric approach helps AI models perform more reliably, but it’s not a silver bullet. Businesses should focus on:

Improving data quality and bias detection, not just fixing errors.
Finding the right balance between accessibility and privacy.
Ensuring AI models understand data context through metadata and expert input.
Building governance policies that go beyond compliance to encourage responsible AI use.

Moreover, human systems—including workflows, roles, and expertise—must adapt not only to collaborate effectively with AI but also to interpret and leverage the critical context provided by robust data lineage. AI is only as good as the data it learns from. By taking a practical, balanced approach to data management, businesses can build AI systems that are not just technically sound, but also trustworthy and effective in real-world settings.

‍

Four Key Pillars of a Data-Centric Approach to AI

Building AI on Solid Ground: A Practical Approach to Data-Centric AI

What is "Quality Data" Anyway?

Accurate and Complete Representation

Consistent and Interpretable Structure

Freedom from Distortion (Noise and Bias)

The Core Principles of Data-Centric AI

1. Data Quality: Accuracy, Completeness, and Bias Awareness

Key Considerations

Practical Steps

2. Data Accessibility: Integration Without Overcomplication

Key Considerations

Practical Steps

3. Data Context: Why Metadata and Human Expertise Matter

Key Considerations

Practical Steps

4. Data Governance: Ethical and Practical Guardrails

Key Considerations

Practical Steps

Challenges and Trade-Offs

What’s Next?

Embracing the AI Agent Revolution: A Practical Roadmap for Business Leaders

AIOps: Insights from an Article Close to My Heart

5 Challenges: How to make your Lakehouse the reservoir that powers GenAI and BI success

Email

Phone

Office