Developer Onboarding Guide¶
Welcome to the developer guide for the In-Dataset Discovery Service.
This document provides a comprehensive overview of the service's architecture and a step-by-step guide for extending it with new features, agents, and tools.
Table of Contents¶
- Service Overview
- Core Principles
- Directory Structure
- Key Components
- System Architecture
- API Layer
- Agent Layer
- Tool Layer
- Adding New Features
- Adding a New API Endpoint
- Adding a New Agent
- Adding a New Tool
- Development Workflow
- Local Setup
- Testing
- Code Style
1. Service Overview¶
The In-Dataset Discovery Service is a FastAPI-based application that provides natural language interfaces for exploring datasets. It uses LLM capabilities to convert user questions into executable queries and agents to orchestrate complex multi-step tasks.
Core Principles¶
The service is built on a few key principles:
- Modularity: Each component (agents, tools, models) is self-contained and can be extended independently.
- LLM-Powered: Leverages LLM and LangChain for intelligent query generation and processing.
- Agent-Based: Uses LangGraph to orchestrate complex workflows that combine multiple tools.
- Type Safety: Uses Pydantic models for request/response validation and type safety.
Directory Structure¶
src/
├── app/ # FastAPI application
│ ├── main.py # API endpoints and routing
│ ├── config.py # Configuration management
│ ├── security.py # Authentication and authorization
│ ├── exceptions.py # Custom exception classes
│ └── logging_config.py # Logging setup
├── agents/ # Agent implementations
│ ├── geospatial.py # Geospatial query agent
│ ├── text2sql.py # Text-to-SQL agent
│ ├── graph.py # LangGraph workflow definitions
│ └── tools/ # Agent tools
│ ├── geospatial_tools.py
│ └── text2sql_tool.py
└── models/ # Pydantic models
└── geospatial_models.py
Key Components¶
- FastAPI Application (
src/app/main.py): Defines all API endpoints and request/response models. - Agents (
src/agents/): Implement the core logic for processing queries: geospatial.py: Handles geospatial queries using Wikidata and Overpass APItext2sql.py: Converts natural language to SQL queries- Tools (
src/agents/tools/): Reusable functions that agents can call: geospatial_tools.py: Tools for geospatial operationstext2sql_tool.py: Tools for SQL generation and execution- Models (
src/agents/models/): Pydantic models for data validation
2. System Architecture¶
API Layer¶
The API layer (src/app/main.py) handles HTTP requests and responses. Each endpoint:
- Validates input using Pydantic models
- Handles authentication (if required)
- Calls the appropriate agent or tool
- Returns structured responses
Agent Layer¶
Agents are the core processing units that handle specific types of queries:
- Geospatial Agent: Processes location-based queries
- Text-to-SQL Agent: Converts questions to SQL queries
Tool Layer¶
Tools are reusable functions that agents can invoke. They encapsulate specific operations like: - Querying external APIs (Overpass, Wikidata) - Generating SQL queries - Executing database queries
3. Adding New Features¶
Adding a New API Endpoint¶
- Define Request/Response Models in
src/app/main.py:
class MyQueryRequest(BaseModel):
question: str = Field(..., description="The question to process")
class MyQueryResponse(BaseModel):
answer: str
metadata: Dict[str, Any]
- Create the Endpoint:
@app.post("/my-endpoint", response_model=MyQueryResponse)
async def my_endpoint(query: MyQueryRequest):
"""Process a query and return results."""
# Your logic here
result = process_query(query.question)
return MyQueryResponse(answer=result, metadata={})
- Add Error Handling:
try:
result = process_query(query.question)
return MyQueryResponse(answer=result, metadata={})
except Exception as e:
logger.error("query_failed", error=str(e))
raise HTTPException(status_code=500, detail=str(e))
Adding a New Agent¶
- Create Agent File in
src/agents/:
# src/agents/my_agent.py
import structlog
from typing import Dict, Any
logger = structlog.get_logger(__name__)
def my_agent(question: str, log: structlog.BoundLogger) -> Dict[str, Any]:
"""Process a question using the my_agent."""
log.info("my_agent_started", question=question)
# Your agent logic here
result = process_question(question)
log.info("my_agent_completed", result=result)
return result
- Add Tools (if needed) in
src/agents/tools/:
# src/agents/tools/my_tool.py
def my_tool(param: str) -> Dict[str, Any]:
"""A tool that does something useful."""
# Tool implementation
return {"result": "value"}
- Integrate with API in
src/app/main.py:
from src.agents.my_agent import my_agent
@app.get("/my-endpoint")
async def my_endpoint(question: str, log: structlog.BoundLogger = Depends(get_logger)):
return my_agent(question, log)
Adding a New Tool¶
- Create Tool File in
src/agents/tools/:
# src/agents/tools/my_tool.py
from typing import Dict, Any
def my_tool(param1: str, param2: int) -> Dict[str, Any]:
"""
Tool description.
Args:
param1: Description of param1
param2: Description of param2
Returns:
Dictionary with results
"""
# Tool implementation
return {"result": "value"}
- Use in Agent:
from src.agents.tools.my_tool import my_tool
def my_agent(question: str) -> Dict[str, Any]:
result = my_tool(param1="value", param2=42)
return result
4. Development Workflow¶
Local Setup¶
-
Clone the repository:
git clone git@github.com:datagems-eosc/in-data-exploration.git cd in-data-exploration -
Install dependencies:
# Using uv (recommended) uv sync # Or using pip pip install -e . -
Set up environment variables:
cp .env.template .env # Edit .env with your configuration -
Run the service:
# Using uvicorn directly uvicorn src.app.main:app --reload --port 8080 # Or using Docker docker build -t in-data-exploration . docker run --env-file .env -p 8080:8080 in-data-exploration
Testing¶
-
Run health check:
curl http://localhost:8080/health -
Test an endpoint:
curl "http://localhost:8080/geospatial?question=What are the coordinates in Berlin?" -
View API documentation:
- Swagger UI: http://localhost:8080/swagger
- ReDoc: http://localhost:8080/redoc
Code Style¶
- Follow PEP 8 style guidelines
- Use type hints for all function parameters and return values
- Use Pydantic models for data validation
- Add docstrings to all public functions and classes
- Use structured logging with
structlog - Handle errors gracefully and return appropriate HTTP status codes
Best Practices¶
- Error Handling: Always wrap agent calls in try-except blocks and log errors appropriately.
- Logging: Use structured logging with context (user ID, correlation ID, etc.).
- Validation: Use Pydantic models for all request/response validation.
- Security: Follow the security patterns established in
src/app/security.pyfor protected endpoints. - Testing: Write tests for new features and ensure they pass before submitting PRs.
Next Steps¶
- Review the Architecture documentation for more details on system design
- Check the API Overview to understand existing endpoints
- Read the Configuration guide for environment setup
- Explore the source code in
src/agents/to see examples of agent implementations