SERVICE // 03
Data Cleanup & AI-Ready Data.
Your AI project's biggest risk isn't the model - it's the data you're feeding it. We dedupe, normalize, label, and structure so the rest of the stack has something useful to work with.
FROM €2,500 · 2-6 WEEKS · PIPELINE INCLUDED
Why this matters.
Most AI pilots stall on data quality, not model choice. A frontier model on messy data will hallucinate. A mid-tier model on clean, well-structured data will quietly outperform. We do the unglamorous step first.
What we'll clean.
- CRM data - duplicate contacts, inconsistent company names, broken relationships, stale records.
- Product catalogs - SKU duplication, inconsistent attributes, category drift, metadata gaps.
- Document dumps - OCR, deduplication, extraction into structured fields, metadata tagging.
- Operational logs + spreadsheets - normalization into a queryable warehouse table.
- Labels + training data - inter-rater agreement baked in.
Method.
- Sample audit. Pull 1,000 rows (or 100 docs). Profile quality. Quantify the mess.
- Cleaning spec. Written rulebook - what's a duplicate, what's canonical, what merges, what's thrown out. You approve.
- Tooling. dbt, Python scripts, off-the-shelf MDM, LLM-based deduplication. Whatever fits.
- Human-in-the-loop review. Near-duplicates and ambiguous merges go through a reviewer queue. No silent data loss.
- Pipeline. Not a one-time clean - an ongoing pipeline so next month's data stays clean.
- Handover. Pipeline docs, owner training, monitoring dashboard.
Price.
- Focused cleanup - €2,500-€5,000. One source, one domain. 2 weeks.
- Multi-source - €5,000-€10,000. 2-3 sources, full pipeline. 3-4 weeks.
- Enterprise / MDM-class - €10,000+. Quoted. 5-6+ weeks.
What "AI-ready" means.
- Unique. No duplicates the system can't resolve.
- Normalized. One format per field (dates, currencies, countries, product units).
- Complete on critical fields. Or explicitly flagged missing.
- Structured. Free text parsed into named fields where it matters.
- Labeled. For ML systems, labels exist and are consistent.