What does 'data readiness for a data agent' mean?

Data readiness means your data can be reliably queried, analyzed, and acted on by an AI-driven tool such as Fabric Copilot or any agentic pipeline. It requires the data to be discoverable, fresh, clean, performant, and centrally accessible — not just technically available.

What are the three types of data sources a data agent needs to handle?

Structured sources (e.g., SQL databases for sales records), semi-structured sources (e.g., JSON logs from event hubs), and unstructured sources (e.g., documents in blob storage). All three need to be inventoried and mapped to the questions the agent will be asked.

How do you check if data is fresh enough for an agent?

Query the maximum update timestamp in each source table — e.g., SELECT MAX(update_date) FROM sales_table — and compare it to the latency your use case requires. Real-time use cases need streaming pipelines; daily reporting use cases may tolerate scheduled ETL.

What query can I use to find data quality issues before deploying an agent?

For SQL sources, use SELECT COUNT(*) - COUNT(column_name) AS nulls, COUNT(*) - COUNT(DISTINCT key_column) AS duplicates FROM table. In Python with pandas, use df.isnull().sum() for nulls and df.duplicated().sum() for duplicates. Generate a per-column profile before deploying any agent on top.

Data Readiness Checklist for a Data Agent

Before your agent can trust your data

A data agent (whether it's Fabric Copilot, a custom AI pipeline, or an automated reporting layer) is only as reliable as the data it runs on. Most deployment failures aren't model failures. They're data failures: missing fields, stale records, ambiguous column names, or tables that take 45 seconds to scan.

This checklist ensures your data is primed for an agent to query, analyze, and act on it reliably. Work through one step at a time. Document your fixes as you go. Use SQL, data profilers, or ETL pipelines for validation.

Step 1: Inventory Data Sources and Identify Gaps

List every relevant source your agent may need to answer questions. Group them by type:

Structured

SQL databases

Sales records, ERP transaction tables, financial ledgers, vendor masters

Semi-structured

JSON / Event logs

Event Hub streams, API payloads, audit logs, webhook outputs

Unstructured

Documents / Blobs

PDFs in blob storage, email attachments, SharePoint files, contracts

For each key question the agent might handle ("What's our Q1 revenue?", "Which vendors have open POs above $50K?"), confirm that the data exists, is accessible, and is mapped to a table or file location.

Action

Build a simple question-to-source mapping table. Example: sales data is in the warehouse but lacks real-time updates → flag for streaming integration. Document gaps separately as sourcing backlog items.

Step 2: Verify Data Freshness

Determine the latency your use case requires. Real-time stock levels need streaming. Daily financial reports may tolerate a scheduled ETL batch. Mismatch between agent expectations and actual refresh frequency produces wrong answers. Confidently delivered.

1

Query last-update timestamps

For each key table, run a freshness check and compare to required latency.

SELECT MAX(update_date) FROM sales_table;

2

Fix staleness with the right pipeline type

Streaming requirement → Apache Kafka or Fabric Eventstream. Daily batch → scheduled ETL or Fabric Data Factory pipeline.

Action

Example: inventory data lags 24 hours but the agent answers "current stock level" questions. Implement an hourly trigger refresh or switch to a streaming source. Document confirmed latency SLAs per table.

Step 3: Assess Data Quality

An agent will not detect bad data; it will answer with it. Profile every table the agent will touch before it goes live.

1

Profile for nulls and duplicates

In Python (pandas): df.isnull().sum() for nulls, df.duplicated().sum() for dupes. In SQL:

SELECT
  COUNT(*) - COUNT(email) AS null_emails,
  COUNT(*) - COUNT(DISTINCT vendor_id) AS dupe_vendors
FROM vendors;

2

Validate schema and column naming

Rename ambiguous fields so the agent can reason about them. CustID → CustomerID. Amt → InvoiceAmount_USD. Check for type mismatches: dates stored as strings, amounts stored as VARCHAR.

3

Clean and document

Deduplicate with:

DELETE FROM vendors
WHERE rowid NOT IN (
  SELECT MIN(rowid) FROM vendors
  GROUP BY vendor_name, tax_id
);

Action

Generate a per-table quality report (null rates, duplicate rates, type issues). If 10% of email fields are null in a customer table, script a fill with defaults or exclude the field from agent context entirely.

Step 4: Evaluate Volume and Performance

A query that takes 30 seconds in a test environment will time out or frustrate users in a live agent. At scale (billions of rows, unpartitioned tables) even simple aggregations become slow.

1

Measure row counts and query times

Use EXPLAIN to identify full table scans before they become production problems.

EXPLAIN SELECT * FROM large_table
WHERE date > '2024-01-01';

2

Partition by date or range for large tables

For a 5B-row log table with 30s query times, add range partitioning:

ALTER TABLE logs
PARTITION BY RANGE (log_date);

3

Index key filter columns

Index columns the agent will filter on: transaction date, vendor ID, entity code, cost center. In Fabric / Delta Lake, use Z-ORDER clustering on high-cardinality filter columns.

Action

Set a query performance target before deployment: e.g., all agent queries must return in under 3 seconds. Profile the five most common agent queries against that target and optimize until met.

Step 5: Migrate or Integrate Data if Required

If your data is scattered across five systems with no unified access layer, the agent will either fail silently or require complex per-source connectors that become maintenance liabilities. Centralize before you deploy.

1

Move data to a unified lakehouse

OneLake in Microsoft Fabric is the recommended target for Finance + AI workloads. Map migration steps per source, then verify post-move row counts match.

-- Verify post-migration row count
SELECT COUNT(*) FROM new_table;
-- Must match: SELECT COUNT(*) FROM source_table;

2

Connect real-time external sources

For sources that can't be migrated (e.g., Salesforce CRM, live ERP), use Fabric Data Factory connectors or Power Query Online to create live feeds with scheduled or triggered refresh.

3

Test the agent on a subset before full deployment

Ask the agent a curated set of 10 questions against the migrated data. Validate every answer against a known-correct source. Only proceed to full deployment after all 10 match.

Action

Example: data is in S3. Use a Fabric copy job to OneLake, then verify counts. For Salesforce, use the Fabric Salesforce connector and schedule hourly syncs for pipeline-critical entities like Accounts and Opportunities.

After each step: test, iterate, document

Don't complete all five steps and then test. After each step, ask the agent a sample set of questions scoped to that layer of readiness. A freshness issue caught after Step 2 is a 2-hour fix. The same issue discovered post-deployment is a trust problem with your stakeholders.

Step 1 done → Ask: can the agent find every data source it needs? Does it error on any question?
Step 2 done → Ask: are the answers based on data from the expected time window?
Step 3 done → Ask: do any answers contain null-driven anomalies or suspicious outliers?
Step 4 done → Ask: do all queries return within your performance target?
Step 5 done → Run the full 10-question validation suite. Sign off only when all answers match known-correct results.

Data Readiness Checklistfor a Data Agent

Before your agent can trust your data

Step 1: Inventory Data Sources and Identify Gaps

Step 2: Verify Data Freshness

Step 3: Assess Data Quality

Step 4: Evaluate Volume and Performance

Step 5: Migrate or Integrate Data if Required

After each step: test, iterate, document

Data Readiness Checklist
for a Data Agent