Service Architecture¶

The Dataset Profiler service is designed with a microservice architecture that enables efficient and scalable dataset profiling. This document outlines the key components and their interactions.

High-Level Architecture¶

The following diagram illustrates the high-level architecture of the Dataset Profiler service:

Architecture Diagram

Component Details¶

API Layer¶

The Backend layer is built using FastAPI and provides the endpoints documented in API Documentation.

Concerning profiling specifically the following endpoints are used:

/profiler/trigger_profile: Submit a new profiling job
/profiler/runner_status/{profile_job_id}: Check the status of the Ray task
/profiler/job_status/{profile_job_id}: Check the status of the profiling job
/profiler/profile/{profile_job_id}: Retrieve the generated profile
/profiler/clean_up: Clean up resources for a completed job

Job Manager¶

The Job Manager is responsible for:

Creating unique job IDs for profiling requests
Submitting jobs to the Ray cluster
Tracking job status (submitting, starting, light profile ready, heavy profiles ready)
Storing and retrieving job results

Profiling Engine¶

The Profiling Engine performs the actual dataset analysis and consists of:

Dataset Specification Parser: Parses and validates input specifications
Distribution Extractor: Identifies and extracts metadata about dataset files
Record Set Extractor: Analyzes the structure and content of datasets
Profile Generator: Assembles the extracted information into standardized profiles

Supported Data Types¶

The service can profile the following types of data:

Tabular Data: CSV files, Excel spreadsheets
Databases: SQL databases with table structures
Documents: Text files, PDF documents
File Collections: Sets of related files

Distributed Computing Layer¶

The service uses Ray for distributed computing, which enables:

Parallel processing of multiple profiling processes
Efficient resource utilization
Fault tolerance and automatic recovery

Profile Job Cache¶

The profile job cache handles:

Job status monitoring
The generated profiles

Both are stored into Redis and are retrievable through the job ID.

Data Flow¶

Client submits a profiling request with dataset specifications
API validates the request and creates a job
Job Manager submits the job to the Ray cluster
Profiling Engine extracts distributions (light profile)
Light profile is stored and made available
If requested, Profiling Engine extracts record sets (heavy profile)
Complete profile is stored and made available to the client