The Curator

Updated April 17, 2026

The Curator is an automated pipeline that takes any dataset and returns structured quality scores, rich metadata, and a plain-language report - so your data is documented, tested, and ready to use.

The Curator

The Curator is an automated pipeline that takes any dataset and returns structured quality scores, rich metadata, and a plain-language report - so your data is documented, tested, and ready to use.

1. Dataset intake

Loads the dataset, detects its file format, and maps out the structure — what columns exist, what types they are, and which can hold empty values.

2. Sample generation

Creates a consistent snapshot of the dataset for faster analysis. Using the same slice every time means results are reproducible across runs.

3. Structural analysis

Measures how complete the data is - scanning every column for missing values and producing a completeness score for each one and for the dataset overall.

4. Validity checks

Validates that values match their expected type and format (dates look like dates, emails look like emails, numbers fall in sensible ranges). Also optionally flags columns that may contain personal data.

5. Uniqueness analysis

Finds duplicate rows and identifies which columns could act as unique identifiers, based on how many distinct values they contain and whether they follow naming conventions like "id" or "key".

6. Distribution analysis

Runs statistical analysis on numeric columns - averages, spread, outlier detection, and distribution shape - to understand whether values behave as expected.

7. Column role detection

Classifies each column by its role - identifier, measure, dimension, or time field - and assigns a semantic type where relevant, such as currency, location coordinate, or postal code.

8. Correlation analysis

Identifies relationships between columns using statistical and machine learning methods. Flags pairs that are so strongly linked they may be redundant, and surfaces non-obvious dependencies.

9. Detailed profiling

Goes column by column to document value frequencies, how spread out the data is, and whether values follow recognisable patterns like date formats or reference codes.

10. Missing value patterns

Investigates why data is missing - whether it appears random, related to other columns, or tied to specific categories. This shapes the right approach to handling gaps before modelling or analysis.

11. Metadata generation

Compiles all results into a machine-readable metadata file and a human-readable report, including a full data dictionary for every column.

12. LLM cataloguing

Uses a language model to write natural language descriptions of the dataset, suggest relevant tags, and propose potential use cases - making the dataset easier to discover and understand.

13. Storage and archiving

Archives every output - the raw file, sample, metadata, report, and quality scores — to cloud or local storage, organised by owner, dataset, and version.

How quality is measured

Every dataset exits the Curator with an overall quality score built from four dimensions. Each one captures a different aspect of whether data can be trusted.

Completeness

How much of the data is actually present, across every column.

Validity

Whether values match their expected types, formats, and rules.

Uniqueness

The absence of duplicate records that would distort analysis.

Consistency

Whether numeric values are stable, with few outliers and predictable spread.

What gets produced

Every run produces five versioned artifacts, stored alongside each other so you always know exactly what was assessed and when.

Metadata (JSON)
Quality report (Markdown)
Quality scores
Dataset sample
Original file

Why it matters

Most data quality work is done once, by hand, and never written down. The Curator makes that process automatic, repeatable, and versioned. Every time a dataset changes, the pipeline reruns and a fresh set of scored artifacts is saved alongside the previous version.

The result is a living audit trail. You can track quality over time, catch problems before they reach downstream models, and hand any stakeholder a plain-language report without writing a single query.