← Resources Checklist

Data Readiness Checklist
for a Data Agent

7 min read Published March 2026 Data Engineering · AI Agents · Fabric Copilot
TL;DR: At a Glance
5 Steps to agent-ready data
3 Source types to inventory
1 Checklist, covers all agent scenarios
Work through each step sequentially. Test your agent on a data subset after each step. Skip a step and you'll chase errors after deployment (not before).

Before your agent can trust your data

A data agent (whether it's Fabric Copilot, a custom AI pipeline, or an automated reporting layer) is only as reliable as the data it runs on. Most deployment failures aren't model failures. They're data failures: missing fields, stale records, ambiguous column names, or tables that take 45 seconds to scan.

This checklist ensures your data is primed for an agent to query, analyze, and act on it reliably. Work through one step at a time. Document your fixes as you go. Use SQL, data profilers, or ETL pipelines for validation.

Step 1: Inventory Data Sources and Identify Gaps

List every relevant source your agent may need to answer questions. Group them by type:

Structured
SQL databases
Sales records, ERP transaction tables, financial ledgers, vendor masters
Semi-structured
JSON / Event logs
Event Hub streams, API payloads, audit logs, webhook outputs
Unstructured
Documents / Blobs
PDFs in blob storage, email attachments, SharePoint files, contracts

For each key question the agent might handle ("What's our Q1 revenue?", "Which vendors have open POs above $50K?"), confirm that the data exists, is accessible, and is mapped to a table or file location.

Action
Build a simple question-to-source mapping table. Example: sales data is in the warehouse but lacks real-time updates → flag for streaming integration. Document gaps separately as sourcing backlog items.

Step 2: Verify Data Freshness

Determine the latency your use case requires. Real-time stock levels need streaming. Daily financial reports may tolerate a scheduled ETL batch. Mismatch between agent expectations and actual refresh frequency produces wrong answers. Confidently delivered.

1
Query last-update timestamps
For each key table, run a freshness check and compare to required latency.
SELECT MAX(update_date) FROM sales_table;
2
Fix staleness with the right pipeline type
Streaming requirement → Apache Kafka or Fabric Eventstream. Daily batch → scheduled ETL or Fabric Data Factory pipeline.
Action
Example: inventory data lags 24 hours but the agent answers "current stock level" questions. Implement an hourly trigger refresh or switch to a streaming source. Document confirmed latency SLAs per table.

Step 3: Assess Data Quality

An agent will not detect bad data; it will answer with it. Profile every table the agent will touch before it goes live.

1
Profile for nulls and duplicates
In Python (pandas): df.isnull().sum() for nulls, df.duplicated().sum() for dupes. In SQL:
SELECT COUNT(*) - COUNT(email) AS null_emails, COUNT(*) - COUNT(DISTINCT vendor_id) AS dupe_vendors FROM vendors;
2
Validate schema and column naming
Rename ambiguous fields so the agent can reason about them. CustID → CustomerID. Amt → InvoiceAmount_USD. Check for type mismatches: dates stored as strings, amounts stored as VARCHAR.
3
Clean and document
Deduplicate with:
DELETE FROM vendors WHERE rowid NOT IN ( SELECT MIN(rowid) FROM vendors GROUP BY vendor_name, tax_id );
Action
Generate a per-table quality report (null rates, duplicate rates, type issues). If 10% of email fields are null in a customer table, script a fill with defaults or exclude the field from agent context entirely.

Step 4: Evaluate Volume and Performance

A query that takes 30 seconds in a test environment will time out or frustrate users in a live agent. At scale (billions of rows, unpartitioned tables) even simple aggregations become slow.

1
Measure row counts and query times
Use EXPLAIN to identify full table scans before they become production problems.
EXPLAIN SELECT * FROM large_table WHERE date > '2024-01-01';
2
Partition by date or range for large tables
For a 5B-row log table with 30s query times, add range partitioning:
ALTER TABLE logs PARTITION BY RANGE (log_date);
3
Index key filter columns
Index columns the agent will filter on: transaction date, vendor ID, entity code, cost center. In Fabric / Delta Lake, use Z-ORDER clustering on high-cardinality filter columns.
Action
Set a query performance target before deployment: e.g., all agent queries must return in under 3 seconds. Profile the five most common agent queries against that target and optimize until met.

Step 5: Migrate or Integrate Data if Required

If your data is scattered across five systems with no unified access layer, the agent will either fail silently or require complex per-source connectors that become maintenance liabilities. Centralize before you deploy.

1
Move data to a unified lakehouse
OneLake in Microsoft Fabric is the recommended target for Finance + AI workloads. Map migration steps per source, then verify post-move row counts match.
-- Verify post-migration row count SELECT COUNT(*) FROM new_table; -- Must match: SELECT COUNT(*) FROM source_table;
2
Connect real-time external sources
For sources that can't be migrated (e.g., Salesforce CRM, live ERP), use Fabric Data Factory connectors or Power Query Online to create live feeds with scheduled or triggered refresh.
3
Test the agent on a subset before full deployment
Ask the agent a curated set of 10 questions against the migrated data. Validate every answer against a known-correct source. Only proceed to full deployment after all 10 match.
Action
Example: data is in S3. Use a Fabric copy job to OneLake, then verify counts. For Salesforce, use the Fabric Salesforce connector and schedule hourly syncs for pipeline-critical entities like Accounts and Opportunities.

After each step: test, iterate, document

Don't complete all five steps and then test. After each step, ask the agent a sample set of questions scoped to that layer of readiness. A freshness issue caught after Step 2 is a 2-hour fix. The same issue discovered post-deployment is a trust problem with your stakeholders.

  • Step 1 done → Ask: can the agent find every data source it needs? Does it error on any question?
  • Step 2 done → Ask: are the answers based on data from the expected time window?
  • Step 3 done → Ask: do any answers contain null-driven anomalies or suspicious outliers?
  • Step 4 done → Ask: do all queries return within your performance target?
  • Step 5 done → Run the full 10-question validation suite. Sign off only when all answers match known-correct results.

Ready to make your data agent-ready?

We'll assess your data readiness across all five dimensions and identify the fastest path to a reliable Fabric Copilot deployment.

Book Data Readiness Review