Architecture Overview¶

This page presents the high-level architecture of the dataset recommender system.

Dataset RecSys Architecture¶

%%{init: {'flowchart': {'nodeSpacing': 22, 'rankSpacing': 22, 'curve': 'linear'}, 'themeVariables': {'fontSize': '12px', 'lineColor': '#9E9E9E', 'edgeLabelBackground':'#ffffff', 'primaryBorderColor':'#BDBDBD', 'clusterBorder':'#E0E0E0', 'lineWidth':'0\.9px'}}}%%
flowchart TB


    %% Representation Pipeline
    subgraph Representation[ ]
        A[Data Ingestion] --> B[Preprocessing]
        B --> E[Optional LLM Enrichment]
        E --> C[Embedding Model]
        C --> D[Vector DB]
    end


    %% Retrieval
    subgraph Retrieval[ ]
        D --> F[Candidate Retrieval]
        F --> G[Optional Reranking Module]
    end


    %% Serving
    subgraph Serving[ ]
        G --> I[Redis Storage]
        I --> J[API Layer]
        J --> K[Client / Platform]
    end

    %% Styling
    classDef update fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px,color:#0D47A1;
    classDef retrieval fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px,color:#E65100;
    classDef serving fill:#F3E5F5,stroke:#8E24AA,stroke-width:2px,color:#4A148C;
    classDef header fill:#FFFFFF,stroke:#BDBDBD,stroke-width:1.5px,color:#333333;

    class A,B,C,D,E update;
    class F,G retrieval;
    class I,J,K serving;

Main Components¶

1. Data Ingestion¶

The pipeline starts from dataset profiles fetched from the DataGEMS API. These profiles include the metadata fields used to represent each dataset.

2. Metadata Preprocessing¶

Before recommendation, the selected metadata fields are cleaned, normalized, and combined into the textual representation that will be embedded.

3. Optional LLM Enrichment¶

An optional enrichment step can transform raw dataset metadata into a richer semantic representation before embedding.

4. Embedding Model¶

The embedding model converts each dataset representation into a dense vector that captures semantic similarity across datasets.

5. Vector Database¶

The generated dataset embeddings are stored in a vector database and reused for recommendation, so the system does not need to recompute them for every request.

6. Candidate Retrieval¶

The system retrieves candidate datasets using similarity-based search over the embedding collection.

7. Reranking Module¶

An optional reranking stage can refine the retrieved candidates before the final top-k recommendation list is produced.

8. Redis Storage¶

Redis stores the final top-k recommendation list for each dataset together with the associated ranking scores, so results can be served efficiently at request time.

9. API Layer¶

The API exposes the stored recommendations to downstream consumers, including the DataGEMS platform or other services.