SERVICE // 03

Data Cleanup & AI-Ready Data.

Your AI project's biggest risk isn't the model - it's the data you're feeding it. We dedupe, normalize, label, and structure so the rest of the stack has something useful to work with.

FROM €2,500 · 2-6 WEEKS · PIPELINE INCLUDED

Why this matters.

Most AI pilots stall on data quality, not model choice. A frontier model on messy data will hallucinate. A mid-tier model on clean, well-structured data will quietly outperform. We do the unglamorous step first.

What we'll clean.

  • CRM data - duplicate contacts, inconsistent company names, broken relationships, stale records.
  • Product catalogs - SKU duplication, inconsistent attributes, category drift, metadata gaps.
  • Document dumps - OCR, deduplication, extraction into structured fields, metadata tagging.
  • Operational logs + spreadsheets - normalization into a queryable warehouse table.
  • Labels + training data - inter-rater agreement baked in.

Method.

  1. Sample audit. Pull 1,000 rows (or 100 docs). Profile quality. Quantify the mess.
  2. Cleaning spec. Written rulebook - what's a duplicate, what's canonical, what merges, what's thrown out. You approve.
  3. Tooling. dbt, Python scripts, off-the-shelf MDM, LLM-based deduplication. Whatever fits.
  4. Human-in-the-loop review. Near-duplicates and ambiguous merges go through a reviewer queue. No silent data loss.
  5. Pipeline. Not a one-time clean - an ongoing pipeline so next month's data stays clean.
  6. Handover. Pipeline docs, owner training, monitoring dashboard.

Price.

  • Focused cleanup - €2,500-€5,000. One source, one domain. 2 weeks.
  • Multi-source - €5,000-€10,000. 2-3 sources, full pipeline. 3-4 weeks.
  • Enterprise / MDM-class - €10,000+. Quoted. 5-6+ weeks.

What "AI-ready" means.

  1. Unique. No duplicates the system can't resolve.
  2. Normalized. One format per field (dates, currencies, countries, product units).
  3. Complete on critical fields. Or explicitly flagged missing.
  4. Structured. Free text parsed into named fields where it matters.
  5. Labeled. For ML systems, labels exist and are consistent.