Skip to content

MathE Material Recommender

This document describes the recommendation approach currently deployed for MathE materials.

The service receives:

  • the MathE question ID
  • the question text, usually in LaTeX format
  • the number of recommendations to return

It returns MathE material IDs, ranked from most to least relevant.

Current Scope

The recommender currently works on PDF materials only.

Materials stored as DOCX, PPTX, videos, and links are not part of the current recommender index.

The current production system also does enforce a hard rule that a recommended material must have exactly the same topic and subtopic as the question, as asked by MathE team.

Signals Used

For each question, the recommender uses:

  • topic
  • subtopic
  • keywords
  • the question itself

For each material, the recommender uses:

  • topic
  • subtopic
  • keywords
  • OCR-extracted text

There are two scores a material gets:

keyword_jaccard
question_to_material_similarity

keyword_jaccard measures keyword overlap between the question and material.

question_to_material_similarity measures similarity between the embedded question text and the stored PDF OCR embedding.

Production Flow

flowchart TB
    A["API request: question_id, question text, k"] --> B["Read question metadata"]
    B --> C["Fetch same-topic/same-subtopic PDF pool"]
    C --> D{"Any eligible PDFs?"}
    D -- "No" --> E["Return no recommendations"]
    D -- "Yes" --> F["Compute keyword overlap for each PDF"]
    F --> G["Embed question text"]
    G --> H["Score pool with question-to-material similarity"]
    H --> I["Compute final score"]
    I --> J["Rank and keep top-k"]
    J --> K["Resolve Redis PDF IDs to MathE material IDs"]
    K --> L["Return material IDs"]

Request-Time Logic, Step By Step

Step 1 - API Receives The Request

The production API endpoint is in:

dataset_recsys/api/routes/mathe.py

The route calls:

recommend_from_curricular_pool(...)

from:

dataset_recsys/mathe_recommenders/curricular_pool_ranker.py

This is the production recommender path.

Step 2 - Load Question Metadata

The recommender first loads the MathE metadata for the question:

topic
subtopic
keywords

This is done through:

MatheMirrorClient.get_question_metadata(...)

in:

dataset_recsys/storage/mathe_mirror_client.py

If the question does not exist in the mirror database, the recommender returns no recommendations.

Step 3 - Build The Eligible Material Pool

The production recommender then fetches all PDF materials assigned to the same topic and subtopic as the question.

This is done through:

MatheMirrorClient.get_pdf_materials_for_question_topic_subtopic(...)

in:

dataset_recsys/storage/mathe_mirror_client.py

Step 4 - Compute Keyword Overlap

For every material in the eligible pool, the recommender computes:

keyword_jaccard =
  |question_keywords intersection material_keywords|
  /
  |question_keywords union material_keywords|

If both keyword sets are empty, the score is 0.0.

The helper is:

compute_keyword_jaccard(...)

in:

dataset_recsys/mathe_recommenders/seed_scoring.py

Inside this production ranker, metadata_score is set equal to keyword_jaccard, because topic and subtopic have already been used to define the pool.

Step 5 - Embed The Question Text

The question text sent by MathE is embedded at request time.

This is done by:

encode_question(...)

in:

dataset_recsys/mathe_recommenders/question_embedding.py

The same embedding model is used for MathE material OCR embeddings.

Step 6 - Score Eligible Materials Against The Question

The recommender scores only the materials already present in the eligible pool.

This is done with:

score_question_similarity_for_material_ids(...)

in:

dataset_recsys/mathe_recommenders/question_embedding.py

Internally, this calls:

EmbeddingClient.find_similar_by_ids(...)

from:

dataset_recsys/storage/embedding_client.py

This means the vector query asks:

For these specific eligible material IDs, how similar is each one to the question?

It does not perform an open nearest-neighbor search over all materials.

If an eligible material has no stored embedding, it keeps:

question_to_material_similarity = 0.0

Step 7 - Compute Final Score

The final score is computed in:

_rank_candidates(...)

inside:

dataset_recsys/mathe_recommenders/curricular_pool_ranker.py

The score is:

final_score =
    lambda * keyword_jaccard
  + (1 - lambda) * question_to_material_similarity

The default is:

lambda = 0.6

So by default:

final_score =
    0.6 * keyword_jaccard
  + 0.4 * question_to_material_similarity

The reason for this weighting is that, after the hard topic/subtopic filter, keyword overlap is the remaining explicit curriculum signal. The question embedding then refines the order inside the same curricular pool.

Step 8 - Return MathE Material IDs

Candidates are keyed internally by material_redis_id, for example:

30.pdf

The API response returns the MathE database material ID, for example:

30

This resolution is done by:

resolve_db_material_ids(...)

in:

dataset_recsys/mathe_recommenders/metadata_ocr.py

What Changed From The Previous Hybrid Version

The previous hybrid recommender used three sources:

metadata seed PDFs
OCR-neighbor PDFs
question-nearest PDFs

That approach was useful for open discovery, but it could recommend materials outside the exact question topic/subtopic.

The older hybrid implementation is still kept for comparison and validation:

dataset_recsys/mathe_recommenders/hybrid.py

Implementation Map

Step File Function Role
API entry point dataset_recsys/api/routes/mathe.py route handler Receives the MathE request and calls the production recommender.
Production recommender dataset_recsys/mathe_recommenders/curricular_pool_ranker.py recommend_from_curricular_pool Returns top-k material IDs from the same topic/subtopic PDF pool.
Candidate ranking dataset_recsys/mathe_recommenders/curricular_pool_ranker.py rank_curricular_pool_candidates Builds the eligible pool, scores it, and ranks candidates.
Final scoring dataset_recsys/mathe_recommenders/curricular_pool_ranker.py _rank_candidates Computes final_score and sorts candidates.
Question metadata dataset_recsys/storage/mathe_mirror_client.py get_question_metadata Reads question topic, subtopic, and keywords.
Eligible pool dataset_recsys/storage/mathe_mirror_client.py get_pdf_materials_for_question_topic_subtopic Fetches only PDFs in the same topic/subtopic as the question.
Keyword score dataset_recsys/mathe_recommenders/seed_scoring.py compute_keyword_jaccard Computes question/material keyword overlap.
Question embedding dataset_recsys/mathe_recommenders/question_embedding.py encode_question Embeds the MathE question text.
Question similarity dataset_recsys/mathe_recommenders/question_embedding.py score_question_similarity_for_material_ids Scores eligible materials against the question embedding.
Vector scoring by IDs dataset_recsys/storage/embedding_client.py find_similar_by_ids Scores only the material IDs already in the eligible pool.
ID resolution dataset_recsys/mathe_recommenders/metadata_ocr.py resolve_db_material_ids Converts internal Redis PDF IDs back to MathE material IDs.
Comparison CLI dataset_recsys/utils/mathe_recsys_compare_cli.py main Runs selected recommender approaches for validation and CSV/JSON export.

Configuration

The production curricular pool ranker is controlled by:

MATHE_CURRICULAR_KEYWORD_WEIGHT
MATHE_EMBEDDING_MODEL

Current defaults:

MATHE_CURRICULAR_KEYWORD_WEIGHT=0.6
MATHE_EMBEDDING_MODEL=BAAI/bge-m3

The question similarity weight is always:

1 - MATHE_CURRICULAR_KEYWORD_WEIGHT

The older recommenders are still available for comparison:

dataset_recsys/mathe_recommenders/metadata_ocr.py
dataset_recsys/mathe_recommenders/question_embedding.py
dataset_recsys/mathe_recommenders/hybrid.py

They can be compared through:

dataset_recsys/utils/mathe_recsys_compare_cli.py