Deployment¶
The Cross-Dataset Discovery service is designed for containerized deployment using Docker.
Docker¶
The repository includes a Dockerfile for the build process.
Dockerfile Stages¶
- Builder Stage:
- Starts from
python:3.11-slim. - Installs Python dependencies.
-
Pre-downloads the
Qdrant/bm25model for FastEmbed to ensure fast startup. -
Final Stage:
- Copies the application code and the pre-downloaded model cache.
- Runs as a non-root user (
appuser). - Exposes port
8000.
Build and Run¶
To build and run the container:
# 1. Build the Docker image
docker build -t cross-dataset-discovery .
# 2. Run the container
# Note: Requires network access to Qdrant and TEI services.
docker run -p 8000:8000 \
-e DB_CONNECTION_STRING="postgresql://..." \
-e OIDC_ISSUER_URL="https://..." \
-e QDRANT_URL="http://host.docker.internal:6333" \
-e TEI_URL="http://host.docker.internal:8080" \
--name cdd-service \
cross-dataset-discovery
Dependencies¶
The service requires several external systems to function.
Runtime Dependencies¶
- Qdrant Vector Database:
- Requirement: A running Qdrant instance containing the datagems collection.
-
Config:
QDRANT_URL,QDRANT_COLLECTION. -
Text Embeddings Inference (TEI):
- Requirement: A running TEI service serving the
BAAI/bge-m3model. -
Config:
TEI_URL. -
OIDC Provider:
-
Requirement: Keycloak (or similar) for JWT validation.
-
DataGEMS Gateway:
- Requirement: API access for fetching user permissions.
Build-time Dependencies¶
- Python 3.11
- FastEmbed Model: The
Qdrant/bm25model is downloaded during the Docker build to enable local sparse vector generation without runtime downloads.