Developer Onboarding Guide¶

Welcome to the developer guide for the In-Dataset Discovery Service.

This document provides a comprehensive overview of the service's architecture and a step-by-step guide for extending it with new features, agents, and tools.

Table of Contents¶

Service Overview
Core Principles
Directory Structure
Key Components
System Architecture
API Layer
Agent Layer
Tool Layer
Adding New Features
Adding a New API Endpoint
Adding a New Agent
Adding a New Tool
Development Workflow
Local Setup
Testing
Code Style

1. Service Overview¶

The In-Dataset Discovery Service is a FastAPI-based application that provides natural language interfaces for exploring datasets. It uses LLM capabilities to convert user questions into executable queries and agents to orchestrate complex multi-step tasks.

Core Principles¶

The service is built on a few key principles:

Modularity: Each component (agents, tools, models) is self-contained and can be extended independently.
LLM-Powered: Leverages LLM and LangChain for intelligent query generation and processing.
Agent-Based: Uses LangGraph to orchestrate complex workflows that combine multiple tools.
Type Safety: Uses Pydantic models for request/response validation and type safety.

Directory Structure¶

src/
├── app/                    # FastAPI application
│   ├── main.py            # API endpoints and routing
│   ├── config.py          # Configuration management
│   ├── security.py        # Authentication and authorization
│   ├── exceptions.py      # Custom exception classes
│   └── logging_config.py  # Logging setup
├── agents/                 # Agent implementations
│   ├── geospatial.py      # Geospatial query agent
│   ├── text2sql.py        # Text-to-SQL agent
│   ├── graph.py           # LangGraph workflow definitions
│   └── tools/             # Agent tools
│       ├── geospatial_tools.py
│       └── text2sql_tool.py
└── models/                 # Pydantic models
    └── geospatial_models.py

Key Components¶

FastAPI Application (src/app/main.py): Defines all API endpoints and request/response models.
Agents (src/agents/): Implement the core logic for processing queries:
geospatial.py: Handles geospatial queries using Wikidata and Overpass API
text2sql.py: Converts natural language to SQL queries
Tools (src/agents/tools/): Reusable functions that agents can call:
geospatial_tools.py: Tools for geospatial operations
text2sql_tool.py: Tools for SQL generation and execution
Models (src/agents/models/): Pydantic models for data validation

2. System Architecture¶

API Layer¶

The API layer (src/app/main.py) handles HTTP requests and responses. Each endpoint: - Validates input using Pydantic models - Handles authentication (if required) - Calls the appropriate agent or tool - Returns structured responses

Agent Layer¶

Agents are the core processing units that handle specific types of queries:

Geospatial Agent: Processes location-based queries
Text-to-SQL Agent: Converts questions to SQL queries

Tool Layer¶

Tools are reusable functions that agents can invoke. They encapsulate specific operations like: - Querying external APIs (Overpass, Wikidata) - Generating SQL queries - Executing database queries

3. Adding New Features¶

Adding a New API Endpoint¶

Define Request/Response Models in src/app/main.py:

class MyQueryRequest(BaseModel):
    question: str = Field(..., description="The question to process")

class MyQueryResponse(BaseModel):
    answer: str
    metadata: Dict[str, Any]

Create the Endpoint:

@app.post("/my-endpoint", response_model=MyQueryResponse)
async def my_endpoint(query: MyQueryRequest):
    """Process a query and return results."""
    # Your logic here
    result = process_query(query.question)
    return MyQueryResponse(answer=result, metadata={})

Add Error Handling:

try:
    result = process_query(query.question)
    return MyQueryResponse(answer=result, metadata={})
except Exception as e:
    logger.error("query_failed", error=str(e))
    raise HTTPException(status_code=500, detail=str(e))

Adding a New Agent¶

Create Agent File in src/agents/:

# src/agents/my_agent.py
import structlog
from typing import Dict, Any

logger = structlog.get_logger(__name__)

def my_agent(question: str, log: structlog.BoundLogger) -> Dict[str, Any]:
    """Process a question using the my_agent."""
    log.info("my_agent_started", question=question)

    # Your agent logic here
    result = process_question(question)

    log.info("my_agent_completed", result=result)
    return result

Add Tools (if needed) in src/agents/tools/:

# src/agents/tools/my_tool.py
def my_tool(param: str) -> Dict[str, Any]:
    """A tool that does something useful."""
    # Tool implementation
    return {"result": "value"}

Integrate with API in src/app/main.py:

from src.agents.my_agent import my_agent

@app.get("/my-endpoint")
async def my_endpoint(question: str, log: structlog.BoundLogger = Depends(get_logger)):
    return my_agent(question, log)

Adding a New Tool¶

Create Tool File in src/agents/tools/:

# src/agents/tools/my_tool.py
from typing import Dict, Any

def my_tool(param1: str, param2: int) -> Dict[str, Any]:
    """
    Tool description.

    Args:
        param1: Description of param1
        param2: Description of param2

    Returns:
        Dictionary with results
    """
    # Tool implementation
    return {"result": "value"}

Use in Agent:

from src.agents.tools.my_tool import my_tool

def my_agent(question: str) -> Dict[str, Any]:
    result = my_tool(param1="value", param2=42)
    return result

4. Development Workflow¶

Local Setup¶

Clone the repository:

git clone git@github.com:datagems-eosc/in-data-exploration.git
cd in-data-exploration

Install dependencies:

# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

Set up environment variables:

cp .env.template .env
# Edit .env with your configuration

Run the service:

# Using uvicorn directly
uvicorn src.app.main:app --reload --port 8080

# Or using Docker
docker build -t in-data-exploration .
docker run --env-file .env -p 8080:8080 in-data-exploration

Testing¶

Run health check:
```
curl http://localhost:8080/health
```

Test an endpoint:

curl "http://localhost:8080/geospatial?question=What are the coordinates in Berlin?"

View API documentation:
Swagger UI: http://localhost:8080/swagger
ReDoc: http://localhost:8080/redoc

Code Style¶

Follow PEP 8 style guidelines
Use type hints for all function parameters and return values
Use Pydantic models for data validation
Add docstrings to all public functions and classes
Use structured logging with structlog
Handle errors gracefully and return appropriate HTTP status codes

Best Practices¶

Error Handling: Always wrap agent calls in try-except blocks and log errors appropriately.
Logging: Use structured logging with context (user ID, correlation ID, etc.).
Validation: Use Pydantic models for all request/response validation.
Security: Follow the security patterns established in src/app/security.py for protected endpoints.
Testing: Write tests for new features and ensure they pass before submitting PRs.

Next Steps¶

Review the Architecture documentation for more details on system design
Check the API Overview to understand existing endpoints
Read the Configuration guide for environment setup
Explore the source code in src/agents/ to see examples of agent implementations