Dataset Profiler¶

This is the documentation site for the Dataset Profiler service. The service is part of the wider DataGEMS platform.

The Dataset Profiler service is designed to automatically analyze and extract metadata from various types of datasets. It supports multiple data formats including CSV files, Excel spreadsheets, databases, text documents, and PDF files. The service generates comprehensive profiles that describe the structure, content, and characteristics of datasets, making them more discoverable and usable. The service as the project progresses will provide profiling for additional data types and formats.

Key Features¶

Multi-format Support: Profiles CSV files, Excel spreadsheets, databases, text documents, and PDF files
Distributed Processing: Uses Ray for distributed computing to handle large datasets efficiently
API-driven: RESTful API for easy integration with other services
Metadata Extraction: Automatically extracts metadata such as column types, data distributions, and sample values
Standardized Output: Produces standardized JSON-LD profiles that follow the Croissant Metadata Schema

How It Works¶

The Dataset Profiler service works by:

Accepting dataset specifications through its API (non inferred metadata like name, license, etc.)
Analyzing the dataset structure and content
Generating a "light" profile with basic metadata (distributions information)
Optionally generating a "heavy" profile with detailed record set information (file structure, fields, data types)
Returning the profile in a standardized JSON-LD format