PLAYBOOK // DATA

Data cleanup as AI readiness work: why data beats models

The unglamorous step that decides whether your AI project ships. A practical cleaning checklist for SMBs.

2026-04-17 ·7 min read ·AISO-DEV dataplaybookai-readiness

TL;DR

  • Most SMB AI pilots stall on data quality, not model choice.
  • Five tests decide whether data is AI-ready: unique, normalized, complete-on-critical, structured, labeled.
  • Ship the pipeline, not just a one-time clean.
  • Tooling: dbt / Python / LLM-assisted dedup / human-in-the-loop review queue, usually a mix.
  • Budget 2-6 weeks for a first domain.

The thesis

Pick any failed AI pilot. Nine times in ten, the failure trace looks like this:

  1. Team plugs an LLM into the CRM.
  2. Model hallucinates. Retrieval misses. Output looks wrong to the sales team.
  3. Team blames the model.
  4. Team tries a bigger model. Same problem.
  5. Project sunsets “until AI gets better.”

What actually happened: the CRM has 4,000 duplicate contacts under six variations of company name, two fields labeled “status,” free-text in the country column, and a notes field used as a garbage dump for seven years.

No model fixes that. This is why we treat data cleanup for AI as a first-class engagement, not a phase you sneak in around model selection.

Five tests your data has to pass

1. Unique

No duplicates the system can’t resolve. “Acme Corp”, “Acme Corporation”, “ACME CORP.”, and “acme inc” are not four customers. If your system can’t collapse them, your agent can’t either.

How to check: fuzzy-dedupe a sample. If more than 5% of records have probable duplicates, you have a problem.

2. Normalized

One format per field. Dates in ISO. Currencies in currency codes. Countries in ISO-3166. Product units normalized (kg not “kilos”/“kilograms”). Company names stripped of “Inc/Corp/Ltd” variants.

How to check: for each field, count distinct values and scan the outliers. If “United States” appears as 8 variants, you have a problem.

3. Complete on critical fields

Not every field has to be filled. But the fields that matter for the task must be - or flagged explicitly missing. For a sales agent: company name, contact email, last interaction date. Missing values should be NULL, never empty strings or “N/A”.

How to check: per-field completeness report on the fields your task needs. Under 85% on a critical field is a fire.

4. Structured where it matters

Free text is fine for notes. Not fine for address. Not fine for role. Not fine for industry. If your task needs to filter or classify on a field, it needs to be structured.

How to check: if your task starts with “the model should figure out the country from the notes field” - the data isn’t structured enough.

5. Labeled where you need it

If you’re running supervised classification or building retrieval with any kind of type tag, the labels have to exist and be consistent. Inter-rater agreement should be at least 0.8 on anything ambiguous.

How to check: sample 100 rows. Have two people label them independently. Compute agreement.

A cleaning pass, start to finish

Here’s the shape of a typical 3-week cleanup we’d scope. We usually pair this with an AI readiness audit so the cleaning rulebook ties back to the downstream agent or feature it is meant to feed.

Week 1 - Sample audit + cleaning spec

  • Pull 1,000 rows (or 100 docs) representative of the production set.
  • Profile each field: nulls, distinct values, format variants, outliers.
  • Write the cleaning rulebook: canonical forms, merge rules, dedup thresholds, what gets thrown out, what gets flagged.
  • Review with the data owner. They sign off before a single row is touched.

Week 2 - Build + dry-run

  • Write the pipeline. Usually: extract → dedupe → normalize → enrich → validate → load.
  • Dry-run on a staging copy. Manually review the diff for the first few hundred merges.
  • Set up a human-in-the-loop queue for ambiguous merges - never silently merge records.
  • Tune thresholds. Ship.

Week 3 - Production + monitoring

  • Run the pipeline on production (with a rollback plan).
  • Ship the monitoring dashboard: records per day, dedup rate, null rate, out-of-bounds rate on critical fields.
  • Handover runbook + train the data owner.

The tooling we reach for

  • SQL + dbt for transformations. Testable, versioned, reviewable. The boring choice and the right one for 80% of work.
  • Python + pandas for messy one-off work. dbt can’t do fuzzy dedup well; Python can.
  • LLM-assisted dedup where fuzzy matching on names, addresses, or free-text fields is the hard part. Structured output, confidence threshold, human review on edges.
  • Cleanlab or similar for label quality analysis on supervised sets.
  • Metabase / Superset for the monitoring dashboard. Cheap, self-hostable, honest.
  • Human review UI - same pattern as agents (see the human-in-the-loop playbook). Ambiguous merges go to a queue. A human approves, edits, or rejects.

Five anti-patterns we keep seeing

  1. “AI will clean the data for us.” Sometimes, partially. Mostly no. LLMs are good at fuzzy-matching company names; they’re bad at making a business decision about which of three duplicate customer records is the canonical one. That’s a rule-plus-review problem.
  2. One-time clean, no pipeline. Next month’s data arrives dirty. You’re back to square one. Ship the pipeline.
  3. No monitoring. You don’t know when quality slips until the agent starts hallucinating. Ship the dashboard.
  4. Cleaning in production without a rollback. Back up first. Always.
  5. Skipping the rulebook. If the cleaning rules aren’t written down, you can’t defend the merges when sales asks why Acme Corp lost its logo.

What “AI-ready” means in practice

A data domain is AI-ready when:

  • You can describe its canonical schema in one page.
  • Completeness on critical fields is >85%.
  • Duplicate rate is <2%.
  • There’s a pipeline keeping it that way.
  • The monitoring dashboard is visible to whoever owns the domain.

If you can’t say yes to all five, model choice won’t save you.

What this usually unlocks

Clients who do the cleanup first:

  • Stop blaming the model.
  • Cut LLM token spend by 20-50% (shorter, cleaner context).
  • Ship downstream agents in weeks, not quarters.
  • Find they don’t need the frontier model for most tasks - a cheaper model on clean data works.

Data cleanup isn’t AI-adjacent. It’s AI’s foundation. Skip this step and even the best custom AI development work runs on noise.


Data domain that needs attention? Scope a data project →

Want us to tell you what’s worth cleaning first? Free AI Readiness Audit →

BUILD // NEXT

Got a project that should be a build log?

Scope a project. We'll ship it. You get a system; we may write about it (with permission).