System Architecture¶
The In-Dataset Discovery Service is a self-contained application designed to act as a specialized query layer within the DataGEMS ecosystem.
Core Components¶
-
FastAPI Application: The heart of the service is a Python web application built with FastAPI. It exposes the REST API, handles incoming HTTP requests, and orchestrates all internal operations.
-
Geospatial Component: A module that processes natural language geospatial queries, identifies places using Wikidata, generates OverpassQL queries, and retrieves geospatial data from OpenStreetMap via the Overpass API.
-
Text-to-SQL Component: An intelligent module that converts natural language questions into SQL queries using LLM capabilities. It handles database schema understanding, query generation, and execution.
-
Security Layer: This layer intercepts all incoming requests to perform authentication and authorization. It integrates with an external OIDC provider to validate JWTs and with the DataGEMS Gateway to fetch user-specific permissions.
Request Flow¶
Geospatial Query Flow¶
- A user sends a
GET /geospatialrequest with a natural language question about a location. - The service uses Wikidata to identify the most relevant place entity.
- An OverpassQL query is generated based on the Wikidata information.
- The Overpass API is queried to retrieve geospatial data.
- Results are formatted and returned, including coordinates, GeoJSON data, and bounding boxes.
Text-to-SQL Query Flow¶
- A user sends a
POST /text2sqlrequest with a natural language question and database connection information. - The Security Layer validates the token and checks for the required user roles (if authentication is enabled).
- The Text-to-SQL Component uses an LLM to analyze the question and database schema.
- A SQL query pattern is generated and parameterized with the provided coordinates or other parameters.
- The SQL query is executed against the specified database.
- Results are returned with the generated SQL and execution results.