MathE Material Recommender¶

This document describes the recommendation approach currently deployed for MathE materials.

The service receives:

the MathE question ID
the question text, usually in LaTeX format
the number of recommendations to return

It returns MathE material IDs, ranked from most to least relevant.

Current Scope¶

The recommender currently works on PDF materials only.

Materials stored as DOCX, PPTX, videos, and links are not part of the current recommender index.

The current production system also does enforce a hard rule that a recommended material must have exactly the same topic and subtopic as the question, as asked by MathE team.

Signals Used¶

For each question, the recommender uses:

topic
subtopic
keywords
the question itself

For each material, the recommender uses:

topic
subtopic
keywords
OCR-extracted text

There are two scores a material gets:

keyword_jaccard
question_to_material_similarity

keyword_jaccard measures keyword overlap between the question and material.

question_to_material_similarity measures similarity between the embedded question text and the stored PDF OCR embedding.

Production Flow¶

flowchart TB
    A["API request: question_id, question text, k"] --> B["Read question metadata"]
    B --> C["Fetch same-topic/same-subtopic PDF pool"]
    C --> D{"Any eligible PDFs?"}
    D -- "No" --> E["Return no recommendations"]
    D -- "Yes" --> F["Compute keyword overlap for each PDF"]
    F --> G["Embed question text"]
    G --> H["Score pool with question-to-material similarity"]
    H --> I["Compute final score"]
    I --> J["Rank and keep top-k"]
    J --> K["Resolve Redis PDF IDs to MathE material IDs"]
    K --> L["Return material IDs"]

Request-Time Logic, Step By Step¶

Step 1 - API Receives The Request¶

The production API endpoint is in:

dataset_recsys/api/routes/mathe.py

The route calls:

recommend_from_curricular_pool(...)

from:

dataset_recsys/mathe_recommenders/curricular_pool_ranker.py

This is the production recommender path.

Step 2 - Load Question Metadata¶

The recommender first loads the MathE metadata for the question:

topic
subtopic
keywords

This is done through:

MatheMirrorClient.get_question_metadata(...)

in:

dataset_recsys/storage/mathe_mirror_client.py

If the question does not exist in the mirror database, the recommender returns no recommendations.

Step 3 - Build The Eligible Material Pool¶

The production recommender then fetches all PDF materials assigned to the same topic and subtopic as the question.

This is done through:

MatheMirrorClient.get_pdf_materials_for_question_topic_subtopic(...)

in:

dataset_recsys/storage/mathe_mirror_client.py

Step 4 - Compute Keyword Overlap¶

For every material in the eligible pool, the recommender computes:

keyword_jaccard =
  |question_keywords intersection material_keywords|
  /
  |question_keywords union material_keywords|

If both keyword sets are empty, the score is 0.0.

The helper is:

compute_keyword_jaccard(...)

in:

dataset_recsys/mathe_recommenders/seed_scoring.py

Inside this production ranker, metadata_score is set equal to keyword_jaccard, because topic and subtopic have already been used to define the pool.

Step 5 - Embed The Question Text¶

The question text sent by MathE is embedded at request time.

This is done by:

encode_question(...)

in:

dataset_recsys/mathe_recommenders/question_embedding.py

The same embedding model is used for MathE material OCR embeddings.

Step 6 - Score Eligible Materials Against The Question¶

The recommender scores only the materials already present in the eligible pool.

This is done with:

score_question_similarity_for_material_ids(...)

in:

dataset_recsys/mathe_recommenders/question_embedding.py

Internally, this calls:

EmbeddingClient.find_similar_by_ids(...)

from:

dataset_recsys/storage/embedding_client.py

This means the vector query asks:

For these specific eligible material IDs, how similar is each one to the question?

It does not perform an open nearest-neighbor search over all materials.

If an eligible material has no stored embedding, it keeps:

question_to_material_similarity = 0.0

Step 7 - Compute Final Score¶

The final score is computed in:

_rank_candidates(...)

inside:

dataset_recsys/mathe_recommenders/curricular_pool_ranker.py

The score is:

final_score =
    lambda * keyword_jaccard
  + (1 - lambda) * question_to_material_similarity

The default is:

lambda = 0.6

So by default:

final_score =
    0.6 * keyword_jaccard
  + 0.4 * question_to_material_similarity

The reason for this weighting is that, after the hard topic/subtopic filter, keyword overlap is the remaining explicit curriculum signal. The question embedding then refines the order inside the same curricular pool.

Step 8 - Return MathE Material IDs¶

Candidates are keyed internally by material_redis_id, for example:

30.pdf

The API response returns the MathE database material ID, for example:

This resolution is done by:

resolve_db_material_ids(...)

in:

dataset_recsys/mathe_recommenders/metadata_ocr.py

What Changed From The Previous Hybrid Version¶

The previous hybrid recommender used three sources:

metadata seed PDFs
OCR-neighbor PDFs
question-nearest PDFs

That approach was useful for open discovery, but it could recommend materials outside the exact question topic/subtopic.

The older hybrid implementation is still kept for comparison and validation:

dataset_recsys/mathe_recommenders/hybrid.py

Implementation Map¶

Step	File	Function	Role
API entry point	`dataset_recsys/api/routes/mathe.py`	route handler	Receives the MathE request and calls the production recommender.
Production recommender	`dataset_recsys/mathe_recommenders/curricular_pool_ranker.py`	`recommend_from_curricular_pool`	Returns top-k material IDs from the same topic/subtopic PDF pool.
Candidate ranking	`dataset_recsys/mathe_recommenders/curricular_pool_ranker.py`	`rank_curricular_pool_candidates`	Builds the eligible pool, scores it, and ranks candidates.
Final scoring	`dataset_recsys/mathe_recommenders/curricular_pool_ranker.py`	`_rank_candidates`	Computes `final_score` and sorts candidates.
Question metadata	`dataset_recsys/storage/mathe_mirror_client.py`	`get_question_metadata`	Reads question topic, subtopic, and keywords.
Eligible pool	`dataset_recsys/storage/mathe_mirror_client.py`	`get_pdf_materials_for_question_topic_subtopic`	Fetches only PDFs in the same topic/subtopic as the question.
Keyword score	`dataset_recsys/mathe_recommenders/seed_scoring.py`	`compute_keyword_jaccard`	Computes question/material keyword overlap.
Question embedding	`dataset_recsys/mathe_recommenders/question_embedding.py`	`encode_question`	Embeds the MathE question text.
Question similarity	`dataset_recsys/mathe_recommenders/question_embedding.py`	`score_question_similarity_for_material_ids`	Scores eligible materials against the question embedding.
Vector scoring by IDs	`dataset_recsys/storage/embedding_client.py`	`find_similar_by_ids`	Scores only the material IDs already in the eligible pool.
ID resolution	`dataset_recsys/mathe_recommenders/metadata_ocr.py`	`resolve_db_material_ids`	Converts internal Redis PDF IDs back to MathE material IDs.
Comparison CLI	`dataset_recsys/utils/mathe_recsys_compare_cli.py`	`main`	Runs selected recommender approaches for validation and CSV/JSON export.

Configuration¶

The production curricular pool ranker is controlled by:

MATHE_CURRICULAR_KEYWORD_WEIGHT
MATHE_EMBEDDING_MODEL

Current defaults:

MATHE_CURRICULAR_KEYWORD_WEIGHT=0.6
MATHE_EMBEDDING_MODEL=BAAI/bge-m3

The question similarity weight is always:

1 - MATHE_CURRICULAR_KEYWORD_WEIGHT

The older recommenders are still available for comparison:

dataset_recsys/mathe_recommenders/metadata_ocr.py
dataset_recsys/mathe_recommenders/question_embedding.py
dataset_recsys/mathe_recommenders/hybrid.py

They can be compared through:

dataset_recsys/utils/mathe_recsys_compare_cli.py