Configuration¶
This document describes how to configure the Dataset Profiler service for different environments and use cases.
Configuration Files¶
The Dataset Profiler service uses YAML configuration files located in the dataset_profiler/configs/ directory:
config_dev.yml: Development environment configurationconfig_prod.yml: Production environment configuration
The appropriate configuration file is loaded based on the ENVIRONMENT environment variable, which can be set to dev or prod.
Configuration Structure¶
The configuration files have the following structure:
mode: "dev" # or "prod"
fastapi:
workers: 1
debug: True
reload: True
host: '0.0.0.0'
port: 4557
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL
app_name: "dataset-profiler"
enable_json_format: false
file_log_path: "./dataset-profiler-dev.log"
Note that the production configuration file (config_prod.yml) does not have fastapi configuration since the deployed service has these parameters in the uvicorn command line found in Dockerfile.api:
mode: "prod"
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL
app_name: "dataset-profiler"
enable_json_format: true
file_log_path: "./logs/dataset-profiler-dev.log"
Configuration Options¶
FastAPI Configuration¶
| Option | Description | Default |
|---|---|---|
| host | Host to bind the API server | 0.0.0.0 |
| port | Port to bind the API server | 4557 |
| debug | Enable debug mode | True (dev) |
| workers | Number of worker processes | 1 |
| reload | Enable auto-reload on code changes | True (dev) |
Logging Configuration¶
| Option | Description | Default |
|---|---|---|
| level | Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) | INFO |
| app_name | Application name for logging | dataset-profiler |
| enable_json_format | Enable JSON formatted logs | false (dev), true (prod) |
| file_log_path | Path for log file | ./dataset-profiler-dev.log (dev), ./logs/dataset-profiler-dev.log (prod) |
Environment Variables¶
Environment variables can be used to override configuration values. The following environment variables are supported based on the .env.example file:
| Variable | Description | Default |
|---|---|---|
| DATAGEMS_POSTGRES_HOST | PostgreSQL database host | host_name |
| DATAGEMS_POSTGRES_PORT | PostgreSQL database port | 5555 |
| DATAGEMS_POSTGRES_USERNAME | PostgreSQL database username | username |
| DATAGEMS_POSTGRES_PASSWORD | PostgreSQL database password | password |
| RAW_DATA_ROOT_PATH | Root path for raw data storage (mount to s3) | '' |
| RAY_ADDRESS | Address of the Ray head node | ray://localhost:10001 |
| REDIS_HOST | Redis host | localhost |
| REDIS_PORT | Redis port | 6379 |
| REDIS_DB | Redis database number | 0 |