PATENT-MATCH PREDICTION 2024-2025 - Task 2: Re-ranking

This directory contains the training data and scripts for the second task of the PATENT-MATCH PREDICTION challenge, which focuses on reranking patent documents based on their relevance to query patents.

Directory Contents

Data Files:

train_queries.json - List of query patent IDs for training
test_queries.json - List of query patent IDs for testing (gold not accessible during the challenge)
train_gold_mapping.json - Gold standard mappings of relevant documents for each training query
shuffled_pre_ranking.json - Initial random ranking of documents for each query
queries_content_with_features.json - Content of the query patents with LLM-extracted features
documents_content_with_features.json - Content of the candidate documents with LLM-extracted features

Both content files include an additional key called features in the Content column. These are LLM-extracted features from the claim set that encapsulate the main features of the invention. You can use these features alone or in combination with other parts of the patent to boost your re-ranking performance.

Scripts:

cross_encoder_reranking_train.py - Script for reranking documents using cross-encoder models (training data only)
evaluate_train_rankings.py - Evaluation script to measure ranking performance (training data only)
metrics.py - Implementation of evaluation metrics (Recall@k, MAP, etc.)

Task Overview

This re-ranking task is more approachable than task 1 because:

We already provide you with a pre-ranking of 30 relevant patents for each query
Your document corpus for each query is limited to just 30 documents (compared to thousands in task 1 or millions in real-life scenarios)

The challenge uses 20 queries for training (with gold mappings provided) and 10 queries for testing (gold mappings not provided during the challenge).

Objectives

Apply dense retrieval methods learned in task 1 to achieve decent baseline results
Develop custom embedding, scoring, and evaluation scripts
Experiment with different text representations (including the provided features)
Implement and compare various ranking methodologies

All the necessary processing can be done on Google Colab using the free GPUs provided, making this task accessible to everyone.

Embedding Models

You can find various embedding models on the MTEB Leaderboard. The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across 56 datasets spanning 8 embedding tasks. This benchmark provides valuable insights into model performance for various retrieval and semantic tasks.

On Huggingface, you'll find:

Pre-trained embedding models ready for use
Model cards with performance metrics and usage instructions
Example code snippets to help you implement these models
Community discussions about model performance and applications

Some top-performing models to consider include E5 variants, BGE models, and various Sentence-Transformers. You can easily load these models using the Huggingface Transformers library.

Running Cross-Encoder Reranking

The cross-encoder reranking script uses pretrained language models to rerank patent documents based on their semantic similarity to query patents. You can use this script directly or adapt it into a notebook for running on Google Colab.

Basic Usage:

python cross_encoder_reranking_train.py --model_name MODEL_NAME --text_type TEXT_TYPE

Parameters:

--model_name: Name of the transformer model to use (default: intfloat/e5-large-v2)
- Other options: infly/inf-retriever-v1-1.5b, Linq-AI-Research/Linq-Embed-Mistral
--text_type: Type of text to extract from patents (default: TA)
- TA: Title and abstract only
- tac1: Title, abstract, and first claim
- claims: All claims
- description: Patent description
- full: All patent content
- features: LLM-extracted features from claims (you need to implement this in extract_text if you wish)
--batch_size: Batch size for processing (default: 4)
--max_length: Maximum sequence length (default: 512)
--output: Output file name (default: predictions2.json)

Examples:

# Basic usage with default parameters
python cross_encoder_reranking_train.py

# Use infly/inf-retriever-v1-1.5b model with title and abstract
python cross_encoder_reranking_train.py --model_name infly/inf-retriever-v1-1.5b --text_type TA

# Use E5 model with title, abstract, and first claim
python cross_encoder_reranking_train.py --model_name intfloat/e5-large-v2 --text_type tac1 --max_length 512

# Use extracted features only
python cross_encoder_reranking_train.py --text_type features

# Use custom batch size and output file
python cross_encoder_reranking_train.py --batch_size 2 --output prediction2.json

Evaluating Results

After running the reranking script, you can evaluate the performance using the evaluation script:

python evaluate_train_rankings.py --pre_ranking shuffled_pre_ranking.json --re_ranking PREDICTIONS_FILE

Parameters:

--pre_ranking: Path to pre-ranking file (default: shuffled_pre_ranking.json)
--re_ranking: Path to re-ranked file from cross-encoder output (default: predictions2.json)
--gold: Path to gold standard mappings (default: train_gold_mapping.json)
--k_values: Comma-separated list of k values for Recall@k (default: 3,5,10,20)

Example:

python evaluate_train_rankings.py --re_ranking prediction2.json

Output Metrics

The evaluation script outputs the following metrics:

Recall@k: Percentage of relevant documents found in the top-k results
MAP (Mean Average Precision): Average precision across all relevant documents
Mean Inverse Rank: Inverse of the rank of the first relevant document (higher is better)
Mean Rank: Average position of the first relevant document (lower is better)

Challenge Task

Your task is to improve the reranking of patent documents to maximize the retrieval metrics on the test set. Use the training data to develop and evaluate your approach, then submit your best model for evaluation on the hidden test set.

Good luck!