Configuration¶
This document describes how to configure the Dataset Profiler service for different environments and use cases.
Configuration Files¶
The Dataset Profiler service uses YAML configuration files located in the dataset_profiler/configs/ directory:
config_dev.yml: Development environment configurationconfig_prod.yml: Production environment configuration
The configuration file is selected via the --config_file command-line argument. When it is not provided, the service falls back to config_prod.yml. The production Docker image (Dockerfile.api) launches uvicorn pointing at config_prod.yml, while for local development you pass --config_file dataset_profiler/configs/config_dev.yml.
Configuration Structure¶
The configuration files have the following structure:
mode: "dev" # or "prod"
fastapi:
workers: 1
debug: True
reload: True
host: '0.0.0.0'
port: 4557
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL
app_name: "dataset-profiler"
enable_json_format: false
file_log_path: "./dataset-profiler-dev.log"
Note that the production configuration file (config_prod.yml) does not have fastapi configuration since the deployed service has these parameters in the uvicorn command line found in Dockerfile.api:
mode: "prod"
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL
app_name: "dataset-profiler"
enable_json_format: true
file_log_path: "./logs/dataset-profiler-dev.log"
Configuration Options¶
FastAPI Configuration¶
| Option | Description | Default |
|---|---|---|
| host | Host to bind the API server | 0.0.0.0 |
| port | Port to bind the API server | 4557 |
| debug | Enable debug mode | True (dev) |
| workers | Number of worker processes | 1 |
| reload | Enable auto-reload on code changes | True (dev) |
Logging Configuration¶
| Option | Description | Default |
|---|---|---|
| level | Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) | INFO |
| app_name | Application name for logging | dataset-profiler |
| enable_json_format | Enable JSON formatted logs | false (dev), true (prod) |
| file_log_path | Path for log file | ./dataset-profiler-dev.log (dev), ./logs/dataset-profiler-dev.log (prod) |
Environment Variables¶
Environment variables can be used to override configuration values. The following environment variables are supported based on the .env.example file:
| Variable | Description | Default (in code) |
|---|---|---|
| DATAGEMS_POSTGRES_HOST | PostgreSQL database host | (unset) |
| DATAGEMS_POSTGRES_PORT | PostgreSQL database port | (unset) |
| DATAGEMS_POSTGRES_USERNAME | PostgreSQL database username | (unset) |
| DATAGEMS_POSTGRES_PASSWORD | PostgreSQL database password | (unset) |
| DATAGEMS_TIMESCALE_DB_HOST | TimescaleDB database host | (unset) |
| DATAGEMS_TIMESCALE_DB_PORT | TimescaleDB database port | (unset) |
| DATAGEMS_TIMESCALE_DB_USERNAME | TimescaleDB database username | (unset) |
| DATAGEMS_TIMESCALE_DB_PASSWORD | TimescaleDB database password | (unset) |
| DATA_ROOT_PATH | Root path prepended to distribution file paths (mount to S3) | '' |
| MOUNT_POINT | Prefix prepended to RawDataPath connector dataset IDs when resolving raw data on the Ray worker |
'' |
| CDD_PROFILE_PATH | Directory where generated CDD profile JSON files are written | '' |
| RAY_ADDRESS | Address of the Ray head node | ray://ray-head:10001 |
| REDIS_HOST | Redis host | redis |
| REDIS_PORT | Redis port | 6379 |
| REDIS_DB | Redis database number | 0 |
| BASE_URL | Root path prefix the API is served under (FastAPI root_path) |
'' |
| ENABLE_AUTH | Enable API authentication | false |
Note
The defaults above are the fallbacks hard-coded in the application. The
.env.example file ships example values that may differ (e.g. it uses
localhost-based hosts suitable for running the dependencies locally).
Authentication¶
The Dataset Profiler service supports token-based authentication for all API endpoints. Authentication can be enabled or disabled using the ENABLE_AUTH environment variable.
When authentication is enabled, all API requests must include a valid JWT token in the Authorization header using the Bearer scheme:
Authorization: Bearer <token>
The token is expected to be a JWT with the following claims:
- client_id: Must be set to 'airflow' for authentication to succeed
Example token payload:
{
"exp": 1762422572,
"iat": 1762422272,
"jti": "trrtcc:208939ba-d97e-4271-af20-d7c3310456b2",
"iss": "https://datagems-dev.scayle.es/oauth/realms/dev",
"aud": "account",
"sub": "a35281e0-641e-41d9-b3ee-b94bf11d866f",
"typ": "Bearer",
"azp": "airflow",
"acr": "1",
"allowed-origins": [
"http://localhost:8088"
],
"realm_access": {
"roles": [
"default-roles-dev",
"offline_access",
"uma_authorization"
]
},
"resource_access": {
"account": {
"roles": [
"manage-account",
"manage-account-links",
"view-profile"
]
}
},
"scope": "profile email",
"clientHost": "172.16.59.4",
"email_verified": false,
"preferred_username": "service-account-airflow",
"clientAddress": "172.16.59.4",
"client_id": "airflow"
}
When authentication is disabled (ENABLE_AUTH=false), API endpoints can be accessed without providing a token.