API Overview¶
The Dataset Profiler service provides a RESTful API for submitting profiling jobs, checking job status, and retrieving generated profiles. This document provides an overview of the API endpoints, request/response formats, and usage patterns.
Authentication¶
The API can be configured to require authentication using JWT tokens. When authentication is enabled (via the ENABLE_AUTH environment variable), all API requests must include a valid JWT token in the Authorization header:
Authorization: Bearer <token>
The token must contain a client_id claim with the value airflow. See the Configuration section for more details on authentication.
Base URL¶
The base URL for all API endpoints regarding profiling is:
https://{host}/profiler
Where {host} is the hostname where the Dataset Profiler service is deployed.
API Endpoints¶
Submit Profiling Job¶
POST /profiler/trigger_profile
Submits a new dataset profiling job.
Request Body¶
{
"profile_specification":
{
"id": "8930240b-a0e8-46e7-ace8-aab2b42fcc01",
"cite_as": "",
"country": "PT",
"date_published": "2025-04-23",
"description": "This dataset was extracted from the MathE platform, an online educational platform developed to support mathematics teaching and learning in higher education. It contains 546 student responses to questions on several mathematical topics. Each record corresponds to an individual answer and includes the following features: Student ID, Student Country, Question ID, Type of Answer (correct or incorrect), Question Level (basic or advanced based on the assessment of the contributing professor), Math Topic (broader mathematical area of the question), Math Subtopic, and Question Keywords. The data spans from February 2019 to December 2023.",
"fields_of_science": [
"MATHEMATICS"
],
"headline": "Dataset for Assessing Mathematics Learning in Higher Education.",
"languages": [
"en"
],
"keywords": [
"math",
"student",
"higher education"
],
"license": "CC0 1.0",
"name": "Mathematics Learning Assessment",
"published_url": "https://dados.ipb.pt//dataset.xhtml?persistentId=doi:10.34620/dadosipb/PW3OWY",
"uploaded_by": "ADMIN",
"data_connectors": [
{
"type": "RawDataPath",
"dataset_id": "8930240b-a0e8-46e7-ace8-aab2b42fcc01"
}
]
},
"only_light_profile": false
}
Dataset ID Redundancy
There is redundancy in the id field of the profile_specification and the dataset_id field of the RawDataPath connector.
This is intentional mainly for local development reasons, but also to allow future flexibility in specifying different dataset IDs in the connectors if needed.
Response¶
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "Job submitted"
}
Check Job Status¶
GET /profiler/job_status/{profile_job_id}
Checks the detailed status of a profiling job.
Response¶
One of the following status values:
- SUBMITTING
- STARTING
- LIGHT_PROFILE_READY
- HEAVY_PROFILES_READY
- FAILED
Check Runner Status¶
GET /profiler/runner_status/{profile_job_id}
Checks the status of the Ray task for a given job ID.
Response¶
One of the following status values:
- pending
- in_progress
- completed
- failed
- unknown
Retrieve Profile¶
GET /profiler/profile/{profile_job_id}
Retrieves the generated profile for a completed job.
Response¶
{
"moma_profile_light": { ... },
"moma_profile_heavy": { ... },
"cdd_profile": { "path": "/path/to/<dataset_id>.json" }
}
The cdd_profile object contains the path to the generated CDD profile JSON file.
It is only populated once the heavy profile completes (it remains an empty object
{} for a light-only profile or while the heavy profile is still running).
Retrieve CDD Profile Path by Dataset ID¶
GET /profiler/cdd_profile_path/{dataset_id}
Retrieves the path to the CDD profile JSON file for a dataset by its dataset ID (rather than the profiling job ID). This is useful for consumers, such as the Cross-Dataset Discovery service, that only know the dataset ID. The CDD profile file is written once the heavy profile completes.
Response¶
{
"cdd_profile_path": "/path/to/<dataset_id>.json"
}
If the profile is not ready yet, cdd_profile_path is null.
Clean Up Job Resources¶
POST /profiler/clean_up
Cleans up resources associated with a completed job.
Request Body¶
{
"profile_job_id": "550e8400-e29b-41d4-a716-446655440000"
}
Response¶
{
"detail": "SUCCESS"
}
Note
This endpoint is currently a placeholder; the cleanup functionality is not yet
implemented and the call always returns SUCCESS.
Monitoring Endpoints¶
In addition to the profiling endpoints, the service exposes health endpoints under
the /monitoring prefix. See Maintenance for details.
| Endpoint | Description |
|---|---|
GET / |
Pure HTTP liveness endpoint that does not depend on Ray or Redis |
GET /monitoring/ready |
Lightweight readiness probe (no dependency checks) |
GET /monitoring/health-check |
Diagnostic report of Redis and Ray dependency health (always HTTP 200) |
Profile Specification¶
The profile specification object contains metadata about the dataset to be profiled:
| Field | Type | Description |
|---|---|---|
| id | UUID | Unique identifier for the dataset |
| name | string | Name of the dataset |
| description | string | Description of the dataset |
| cite_as | string | Citation information |
| license | string | License information |
| published_url | string | URL where the dataset is published |
| doi | string | Digital Object Identifier for the dataset (optional) |
| headline | string | Short headline describing the dataset |
| keywords | array | List of keywords |
| fields_of_science | array | List of scientific fields |
| languages | array | List of languages used in the dataset |
| country | string | Country code |
| date_published | string | Publication date |
| uploaded_by | string | User who uploaded the dataset |
| data_connectors | array | List of data connectors for accessing the dataset (more info below) |
Data Connectors¶
Data connectors specify how to access the raw data for profiling. The following connector types are supported:
- RawDataPath: Specifies a path to the raw data files.
- DatabaseConnection: Specifies connection details for a database.
Example of a list of connectors:
[
{
"type": "RawDataPath",
"dataset_id": "8930240b-a0e8-46e7-ace8-aab2b42fcc01"
},
{
"type": "DatabaseConnection",
"protocol": "postgresql",
"engine": "PostgreSQL",
"database_name": "ds_era5_land",
"host": "172.16.59.6",
"port": 5432
}
]
- The
dataset_idfield should be the unique identifier of the dataset in the storage system.
Profile Types¶
The service generates three types of profiles:
- Light Profile: Basic metadata about the dataset and its distributions (files)
- Heavy Profile: Detailed information about the dataset structure, including record sets and field information
- CDD Profile: Profile used by the Cross-Dataset Discovery service. The heavy profiling step writes the CDD profile to a JSON file and the
cdd_profilefield of the profile response contains the path to that file ({"path": "..."}). The path can also be looked up by dataset ID viaGET /profiler/cdd_profile_path/{dataset_id}.
CDD Profile availability
The CDD profile is only produced as part of the heavy profile. For a light-only
profile (only_light_profile: true) the cdd_profile field stays an empty object.
Typical API Usage Flow¶
- If authentication is enabled, obtain a valid JWT token
- Submit a profiling job using
POST /profiler/trigger_profile(with Authorization header if auth is enabled) - Poll the job status using
GET /profiler/job_status/{profile_job_id}(with Authorization header if auth is enabled) - Once the status is
LIGHT_PROFILE_READYorHEAVY_PROFILES_READY, retrieve the profile usingGET /profiler/profile/{profile_job_id}(with Authorization header if auth is enabled) - Optionally, clean up resources using
POST /profiler/clean_up(with Authorization header if auth is enabled)