The Curator
The Curator is an automated pipeline that takes any dataset and returns structured quality scores, rich metadata, and a plain-language report - so your data is documented, tested, and ready to use.
The Curator
The Curator is an automated pipeline that takes any dataset and returns structured quality scores, rich metadata, and a plain-language report - so your data is documented, tested, and ready to use.
The 13 steps:
1. Dataset intake
Loads the dataset, detects its file format, and maps out the structure — what columns exist, what types they are, and which can hold empty values.
2. Sample generation
Creates a consistent snapshot of the dataset for faster analysis. Using the same slice every time means results are reproducible across runs.
3. Structural analysis
Measures how complete the data is - scanning every column for missing values and producing a completeness score for each one and for the dataset overall.
4. Validity checks
Validates that values match their expected type and format (dates look like dates, emails look like emails, numbers fall in sensible ranges). Also optionally flags columns that may contain personal data.
5. Uniqueness analysis
Finds duplicate rows and identifies which columns could act as unique identifiers, based on how many distinct values they contain and whether they follow naming conventions like "id" or "key".
6. Distribution analysis
Runs statistical analysis on numeric columns - averages, spread, outlier detection, and distribution shape - to understand whether values behave as expected.
7. Column role detection
Classifies each column by its role - identifier, measure, dimension, or time field - and assigns a semantic type where relevant, such as currency, location coordinate, or postal code.
8. Correlation analysis
Identifies relationships between columns using statistical and machine learning methods. Flags pairs that are so strongly linked they may be redundant, and surfaces non-obvious dependencies.
9. Detailed profiling
Goes column by column to document value frequencies, how spread out the data is, and whether values follow recognisable patterns like date formats or reference codes.
10. Missing value patterns
Investigates why data is missing - whether it appears random, related to other columns, or tied to specific categories. This shapes the right approach to handling gaps before modelling or analysis.
11. Metadata generation
Compiles all results into a machine-readable metadata file and a human-readable report, including a full data dictionary for every column.
12. LLM cataloguing
Uses a language model to write natural language descriptions of the dataset, suggest relevant tags, and propose potential use cases - making the dataset easier to discover and understand.
13. Storage and archiving
Archives every output - the raw file, sample, metadata, report, and quality scores — to cloud or local storage, organised by owner, dataset, and version.
How quality is measured
Every dataset exits the Curator with an overall quality score built from four dimensions. Each one captures a different aspect of whether data can be trusted.
Completeness
How much of the data is actually present, across every column.
Validity
Whether values match their expected types, formats, and rules.
Uniqueness
The absence of duplicate records that would distort analysis.
Consistency
Whether numeric values are stable, with few outliers and predictable spread.
What gets produced
Every run produces five versioned artifacts, stored alongside each other so you always know exactly what was assessed and when.
Why it matters
Most data quality work is done once, by hand, and never written down. The Curator makes that process automatic, repeatable, and versioned. Every time a dataset changes, the pipeline reruns and a fresh set of scored artifacts is saved alongside the previous version.
The result is a living audit trail. You can track quality over time, catch problems before they reach downstream models, and hand any stakeholder a plain-language report without writing a single query.