Spaces:

darpanaswal
/

Patent_Retrieval

Configuration error

App Files Files Community

darpanaswal commited on Apr 6

Commit

3b98ef0

verified ·

1 Parent(s): c875dd8

Upload 12 files

Browse files

Files changed (13) hide show

.gitattributes +1 -0
README.md +128 -11
cross_encoder_reranking_train.py +370 -0
datasets/documents_content_with_features.json +3 -0
datasets/predictions2.json +642 -0
datasets/queries_content_with_features.json +0 -0
datasets/shuffled_pre_ranking.json +962 -0
datasets/test_queries.json +12 -0
datasets/train_gold_mapping.json +111 -0
datasets/train_queries.json +22 -0
evaluate_train_rankings.py +101 -0
metrics.py +207 -0
requirements.txt +73 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+datasets/documents_content_with_features.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,11 +1,128 @@
----
-title: Patent Retrieval
-emoji: 📊
-colorFrom: red
-colorTo: yellow
-sdk: static
-pinned: false
-short_description: Space for Information Retrieval Contest
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# PATENT-MATCH PREDICTION 2024-2025 - Task 2: Re-ranking
+This directory contains the training data and scripts for the second task of the PATENT-MATCH PREDICTION challenge, which focuses on reranking patent documents based on their relevance to query patents.
+## Directory Contents
+### Data Files:
+- `train_queries.json` - List of query patent IDs for training
+- `test_queries.json` - List of query patent IDs for testing (gold not accessible during the challenge)
+- `train_gold_mapping.json` - Gold standard mappings of relevant documents for each training query
+- `shuffled_pre_ranking.json` - Initial random ranking of documents for each query
+- `queries_content_with_features.json` - Content of the query patents with LLM-extracted features
+- `documents_content_with_features.json` - Content of the candidate documents with LLM-extracted features
+Both content files include an additional key called `features` in the Content column. These are LLM-extracted features from the claim set that encapsulate the main features of the invention. You can use these features alone or in combination with other parts of the patent to boost your re-ranking performance.
+### Scripts:
+- `cross_encoder_reranking_train.py` - Script for reranking documents using cross-encoder models (training data only)
+- `evaluate_train_rankings.py` - Evaluation script to measure ranking performance (training data only)
+- `metrics.py` - Implementation of evaluation metrics (Recall@k, MAP, etc.)
+## Task Overview
+This re-ranking task is more approachable than task 1 because:
+- We already provide you with a pre-ranking of 30 relevant patents for each query
+- Your document corpus for each query is limited to just 30 documents (compared to thousands in task 1 or millions in real-life scenarios)
+The challenge uses 20 queries for training (with gold mappings provided) and 10 queries for testing (gold mappings not provided during the challenge).
+## Objectives
+1. Apply dense retrieval methods learned in task 1 to achieve decent baseline results
+2. Develop custom embedding, scoring, and evaluation scripts
+3. Experiment with different text representations (including the provided features)
+4. Implement and compare various ranking methodologies
+All the necessary processing can be done on Google Colab using the free GPUs provided, making this task accessible to everyone.
+## Embedding Models
+You can find various embedding models on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across 56 datasets spanning 8 embedding tasks. This benchmark provides valuable insights into model performance for various retrieval and semantic tasks.
+On Huggingface, you'll find:
+- Pre-trained embedding models ready for use
+- Model cards with performance metrics and usage instructions
+- Example code snippets to help you implement these models
+- Community discussions about model performance and applications
+Some top-performing models to consider include E5 variants, BGE models, and various Sentence-Transformers. You can easily load these models using the Huggingface Transformers library.
+## Running Cross-Encoder Reranking
+The cross-encoder reranking script uses pretrained language models to rerank patent documents based on their semantic similarity to query patents. You can use this script directly or adapt it into a notebook for running on Google Colab.
+### Basic Usage:
+```bash
+python cross_encoder_reranking_train.py --model_name MODEL_NAME --text_type TEXT_TYPE
+```
+### Parameters:
+- `--model_name`: Name of the transformer model to use (default: intfloat/e5-large-v2)
+  - Other options: infly/inf-retriever-v1-1.5b, Linq-AI-Research/Linq-Embed-Mistral
+- `--text_type`: Type of text to extract from patents (default: TA)
+  - TA: Title and abstract only
+  - tac1: Title, abstract, and first claim
+  - claims: All claims
+  - description: Patent description
+  - full: All patent content
+  - features: LLM-extracted features from claims (you need to implement this in extract_text if you wish)
+- `--batch_size`: Batch size for processing (default: 4)
+- `--max_length`: Maximum sequence length (default: 512)
+- `--output`: Output file name (default: predictions2.json)
+### Examples:
+```bash
+# Basic usage with default parameters
+python cross_encoder_reranking_train.py
+# Use infly/inf-retriever-v1-1.5b model with title and abstract
+python cross_encoder_reranking_train.py --model_name infly/inf-retriever-v1-1.5b --text_type TA
+# Use E5 model with title, abstract, and first claim
+python cross_encoder_reranking_train.py --model_name intfloat/e5-large-v2 --text_type tac1 --max_length 512
+# Use extracted features only
+python cross_encoder_reranking_train.py --text_type features
+# Use custom batch size and output file
+python cross_encoder_reranking_train.py --batch_size 2 --output prediction2.json
+```
+## Evaluating Results
+After running the reranking script, you can evaluate the performance using the evaluation script:
+```bash
+python evaluate_train_rankings.py --pre_ranking shuffled_pre_ranking.json --re_ranking PREDICTIONS_FILE
+```
+### Parameters:
+- `--pre_ranking`: Path to pre-ranking file (default: shuffled_pre_ranking.json)
+- `--re_ranking`: Path to re-ranked file from cross-encoder output (default: predictions2.json)
+- `--gold`: Path to gold standard mappings (default: train_gold_mapping.json)
+- `--k_values`: Comma-separated list of k values for Recall@k (default: 3,5,10,20)
+### Example:
+```bash
+python evaluate_train_rankings.py --re_ranking prediction2.json
+```
+## Output Metrics
+The evaluation script outputs the following metrics:
+- Recall@k: Percentage of relevant documents found in the top-k results
+- MAP (Mean Average Precision): Average precision across all relevant documents
+- Mean Inverse Rank: Inverse of the rank of the first relevant document (higher is better)
+- Mean Rank: Average position of the first relevant document (lower is better)
+## Challenge Task
+Your task is to improve the reranking of patent documents to maximize the retrieval metrics on the test set. Use the training data to develop and evaluate your approach, then submit your best model for evaluation on the hidden test set.
+Good luck!

cross_encoder_reranking_train.py ADDED Viewed

	@@ -0,0 +1,370 @@

+import os
+import json
+import argparse
+import numpy as np
+import torch
+import torch.nn.functional as F
+from tqdm import tqdm
+from torch import Tensor
+from sentence_transformers import SentenceTransformer
+from transformers import AutoTokenizer, AutoModel
+from sklearn.cluster import AgglomerativeClustering
+from sklearn.metrics.pairwise import cosine_similarity
+# Load embedder once
+embedder = SentenceTransformer("all-MiniLM-L6-v2")
+def embed_text_list(texts):
+    return embedder.encode(texts, convert_to_tensor=False)
+def rank_by_centrality(texts):
+    embeddings = embed_text_list(texts)
+    similarity_matrix = cosine_similarity(embeddings)
+    centrality_scores = similarity_matrix.mean(axis=1)
+    ranked = sorted(zip(texts, centrality_scores), key=lambda x: x[1], reverse=True)
+    return [text for text, _ in ranked]
+def cluster_and_rank(texts, threshold=0.75):
+    if len(texts) <= 3:
+        return texts  # Nothing to reduce
+    embeddings = embed_text_list(texts)
+    clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=1-threshold, metric = "cosine", linkage='average')
+    labels = clustering.fit_predict(embeddings)
+    clustered_texts = {}
+    for label, text in zip(labels, texts):
+        clustered_texts.setdefault(label, []).append(text)
+    representative_texts = []
+    for cluster_texts in clustered_texts.values():
+        ranked = rank_by_centrality(cluster_texts)
+        representative_texts.append(ranked[0])  # Choose most central per cluster
+    return representative_texts
+def process_single_patent(patent_dict):
+    claims = [v for k, v in patent_dict.items() if k.startswith("c-en")]
+    paragraphs = [v for k, v in patent_dict.items() if k.startswith("p")]
+    features = [v for k, v in patent_dict.get("features", {}).items()]
+    # Cluster & rank
+    top_claims = cluster_and_rank(claims)
+    top_paragraphs = cluster_and_rank(paragraphs)
+    top_features = cluster_and_rank(features)
+    return {
+        "claims": rank_by_centrality(top_claims),
+        "paragraphs": rank_by_centrality(top_paragraphs),
+        "features": rank_by_centrality(top_features),
+    }
+def load_json_file(file_path):
+    """Load JSON data from a file"""
+    with open(file_path, 'r') as f:
+        return json.load(f)
+def save_json_file(data, file_path):
+    """Save data to a JSON file"""
+    with open(file_path, 'w') as f:
+        json.dump(data, f, indent=2)
+def load_content_data(file_path):
+    """Load content data from a JSON file"""
+    with open(file_path, 'r') as f:
+        data = json.load(f)
+    # Create a dictionary mapping FAN to Content
+    content_dict = {item['FAN']: item['Content'] for item in data}
+    return content_dict
+def extract_text(content_dict, text_type="full"):
+    """Extract text from patent content based on text_type"""
+    if text_type == "TA" or text_type == "title_abstract":
+        # Extract title and abstract
+        title = content_dict.get("title", "")
+        abstract = content_dict.get("pa01", "")
+        return f"{title} {abstract}".strip()
+    elif text_type == "claims":
+        # Extract all claims (keys starting with 'c')
+        claims = []
+        for key, value in content_dict.items():
+            if key.startswith('c-'):
+                claims.append(value)
+        return " ".join(claims)
+    elif text_type == "tac1":
+        # Extract title, abstract, and first claim
+        title = content_dict.get("title", "")
+        abstract = content_dict.get("pa01", "")
+        # Find the first claim safely
+        first_claim = ""
+        for key, value in content_dict.items():
+            if key.startswith('c-'):
+                first_claim = value
+                break
+        return f"{title} {abstract} {first_claim}".strip()
+    elif text_type == "description":
+        # Extract all paragraphs (keys starting with 'p')
+        paragraphs = []
+        for key, value in content_dict.items():
+            if key.startswith('p'):
+                paragraphs.append(value)
+        return " ".join(paragraphs)
+    elif text_type == "full":
+        # Extract everything
+        all_text = []
+        # Start with title and abstract for better context at the beginning
+        if "title" in content_dict:
+            all_text.append(content_dict["title"])
+        if "pa01" in content_dict:
+            all_text.append(content_dict["pa01"])
+        # Add claims and description
+        for key, value in content_dict.items():
+            if key != "title" and key != "pa01":
+                all_text.append(value)
+        return " ".join(all_text)
+    elif text_type == "smart":
+        filtered_dict = process_single_patent(content_dict)
+        all_text = []
+        # Start with abstract for better context at the beginning
+        if "pa01" in content_dict:
+            all_text.append(content_dict["pa01"])
+        # For claims, paragraphs and features, we take only the top-10 most relevant
+        # Add claims
+        for claim in filtered_dict["claims"][:10]:
+            all_text.append(claim)
+        # Add paragraphs
+        for paragraph in filtered_dict["paragraphs"][:10]:
+            all_text.append(paragraph)
+        # Add features
+        for feature in filtered_dict["features"][:10]:
+            all_text.append(feature)
+        return " ".join(all_text)
+    return ""
+def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
+    """Extract the last token representations for pooling"""
+    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
+    if left_padding:
+        return last_hidden_states[:, -1]
+    else:
+        sequence_lengths = attention_mask.sum(dim=1) - 1
+        batch_size = last_hidden_states.shape[0]
+        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
+def get_detailed_instruct(task_description: str, query: str) -> str:
+    """Create an instruction-formatted query"""
+    return f'Instruct: {task_description}\nQuery: {query}'
+def cross_encoder_reranking(query_text, doc_texts, model, tokenizer, batch_size=8, max_length=2048):
+    """
+    Rerank document texts based on query text using cross-encoder model
+    Parameters:
+    query_text (str): The query text
+    doc_texts (list): List of document texts
+    model: The cross-encoder model
+    tokenizer: The tokenizer for the model
+    batch_size (int): Batch size for processing
+    max_length (int): Maximum sequence length
+    Returns:
+    list: Indices of documents sorted by relevance score (descending)
+    """
+    device = next(model.parameters()).device
+    scores = []
+    # Format query with instruction
+    task_description = 'Re-rank a set of retrieved patents based on their relevance to a given query patent. The task aims to refine the order of patents by evaluating their semantic similarity to the query patent, ensuring that the most relevant patents appear at the top of the list.'
+    instructed_query = get_detailed_instruct(task_description, query_text)
+    # Process in batches to avoid OOM
+    for i in tqdm(range(0, len(doc_texts), batch_size), desc="Scoring documents", leave=False):
+        batch_docs = doc_texts[i:i+batch_size]
+        # Prepare input pairs for the batch
+        input_texts = [instructed_query] + batch_docs
+        # Tokenize
+        with torch.no_grad():
+            batch_dict = tokenizer(input_texts, max_length=max_length, padding=True,
+                                  truncation=True, return_tensors='pt').to(device)
+            # Get embeddings
+            outputs = model(**batch_dict)
+            embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
+            # Normalize embeddings
+            embeddings = F.normalize(embeddings, p=2, dim=1)
+            # Calculate similarity scores between query and documents
+            batch_scores = (embeddings[0].unsqueeze(0) @ embeddings[1:].T).squeeze(0) * 100
+            scores.extend(batch_scores.cpu().tolist())
+    # Create list of (index, score) tuples for sorting
+    indexed_scores = list(enumerate(scores))
+    # Sort by score in descending order
+    indexed_scores.sort(key=lambda x: x[1], reverse=True)
+    # Return sorted indices
+    return [idx for idx, _ in indexed_scores]
+def main():
+    parser = argparse.ArgumentParser(description='Re-rank patents using cross-encoder scoring (training queries only)')
+    parser.add_argument('--pre_ranking', type=str, default='shuffled_pre_ranking.json',
+                        help='Path to pre-ranking JSON file')
+    parser.add_argument('--output', type=str, default='predictions2.json',
+                        help='Path to output re-ranked JSON file')
+    parser.add_argument('--queries_content', type=str,
+                        default='./queries_content_with_features.json',
+                        help='Path to queries content JSON file')
+    parser.add_argument('--documents_content', type=str,
+                        default='./documents_content_with_features.json',
+                        help='Path to documents content JSON file')
+    parser.add_argument('--queries_list', type=str, default='train_queries.json',
+                        help='Path to training queries JSON file')
+    parser.add_argument('--text_type', type=str, default='TA',
+                        choices=['TA', 'claims', 'description', 'full', 'tac1', 'smart'],
+                        help='Type of text to use for scoring')
+    parser.add_argument('--model_name', type=str, default='intfloat/e5-large-v2',
+                        help='Name of the cross-encoder model')
+    parser.add_argument('--batch_size', type=int, default=4,
+                        help='Batch size for scoring')
+    parser.add_argument('--max_length', type=int, default=512,
+                        help='Maximum sequence length')
+    parser.add_argument('--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu',
+                        help='Device to use (cuda/cpu)')
+    parser.add_argument('--base_dir', type=str,
+                        default='datasets',
+                        help='Base directory for data files')
+    args = parser.parse_args()
+    # Ensure all paths are relative to base_dir if they're not absolute
+    def get_full_path(path):
+        if os.path.isabs(path):
+            return path
+        return os.path.join(args.base_dir, path)
+    # Load training queries
+    print(f"Loading training queries from {args.queries_list}...")
+    queries_list = load_json_file(get_full_path(args.queries_list))
+    print(f"Loaded {len(queries_list)} training queries")
+    # Load pre-ranking data
+    print(f"Loading pre-ranking data from {args.pre_ranking}...")
+    pre_ranking = load_json_file(get_full_path(args.pre_ranking))
+    # Filter pre-ranking to include only training queries
+    pre_ranking = {fan: docs for fan, docs in pre_ranking.items() if fan in queries_list}
+    print(f"Filtered pre-ranking to {len(pre_ranking)} training queries")
+    # Load content data
+    print(f"Loading query content from {args.queries_content}...")
+    queries_content = load_content_data(get_full_path(args.queries_content))
+    print(f"Loading document content from {args.documents_content}...")
+    documents_content = load_content_data(get_full_path(args.documents_content))
+    # Load model and tokenizer
+    print(f"Loading model {args.model_name}...")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+    model = AutoModel.from_pretrained(args.model_name).to(args.device)
+    model.eval()
+    # Process each query and re-rank its documents
+    print("Starting re-ranking process for training queries...")
+    re_ranked = {}
+    missing_query_fans = []
+    missing_doc_fans = {}
+    for query_fan, pre_ranked_docs in tqdm(pre_ranking.items(), desc="Processing queries"):
+        # Check if query FAN exists in our content data
+        if query_fan not in queries_content:
+            missing_query_fans.append(query_fan)
+            continue
+        # Extract query text
+        query_text = extract_text(queries_content[query_fan], args.text_type)
+        if not query_text:
+            missing_query_fans.append(query_fan)
+            continue
+        # Prepare document texts and keep track of their fans
+        doc_texts = []
+        doc_fans = []
+        missing_docs_for_query = []
+        for doc_fan in pre_ranked_docs:
+            if doc_fan not in documents_content:
+                missing_docs_for_query.append(doc_fan)
+                continue
+            doc_text = extract_text(documents_content[doc_fan], args.text_type)
+            if doc_text:
+                doc_texts.append(doc_text)
+                doc_fans.append(doc_fan)
+        # Keep track of missing documents
+        if missing_docs_for_query:
+            missing_doc_fans[query_fan] = missing_docs_for_query
+        # Skip if no valid documents
+        if not doc_texts:
+            re_ranked[query_fan] = []
+            continue
+        # Re-rank documents
+        print(f"\nRe-ranking {len(doc_texts)} documents for training query {query_fan}")
+        # Print some of the original pre-ranking order for debugging
+        print(f"Original pre-ranking (first 3): {doc_fans[:3]}")
+        # Use cross-encoder model for reranking
+        sorted_indices = cross_encoder_reranking(
+            query_text, doc_texts, model, tokenizer,
+            batch_size=args.batch_size, max_length=args.max_length
+        )
+        re_ranked[query_fan] = [doc_fans[i] for i in sorted_indices]
+    # Report any missing FANs
+    if missing_query_fans:
+        print(f"Warning: {len(missing_query_fans)} query FANs were not found in the content data")
+    if missing_doc_fans:
+        total_missing = sum(len(docs) for docs in missing_doc_fans.values())
+        print(f"Warning: {total_missing} document FANs were not found in the content data")
+    # Save re-ranked results
+    output_path = get_full_path(args.output)
+    print(f"Saving re-ranked results to {output_path}...")
+    save_json_file(re_ranked, output_path)
+    print("Re-ranking complete!")
+    print(f"Number of training queries processed: {len(re_ranked)}")
+    # Optionally save the missing FANs information for debugging
+    if missing_query_fans or missing_doc_fans:
+        missing_info = {
+            "missing_query_fans": missing_query_fans,
+            "missing_doc_fans": missing_doc_fans
+        }
+        missing_info_path = f"{os.path.splitext(output_path)[0]}_missing_fans.json"
+        save_json_file(missing_info, missing_info_path)
+        print(f"Information about missing FANs saved to {missing_info_path}")
+if __name__ == "__main__":
+    main()

datasets/documents_content_with_features.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fcd453767cc0510e362087256cd4655c7d2c911dca351d9837ce76b89bc94ee1
+size 68483618

datasets/predictions2.json ADDED Viewed

	@@ -0,0 +1,642 @@

+{
+  "79314580": [
+    "90742932",
+    "84307890",
+    "72116262",
+    "72352339",
+    "92464359",
+    "43243494",
+    "99159171",
+    "87379239",
+    "84339687",
+    "43379960",
+    "91207030",
+    "87321987",
+    "93275449",
+    "97088252",
+    "100886158",
+    "107646251",
+    "68932616",
+    "77860027",
+    "31705225",
+    "95503744",
+    "74928830",
+    "43629694",
+    "45747223",
+    "68476789",
+    "71185778",
+    "1360767",
+    "74251396",
+    "1692313",
+    "93620723",
+    "84816716"
+  ],
+  "78061231": [
+    "82741321",
+    "45789888",
+    "44264550",
+    "87988246",
+    "82567288",
+    "17816910",
+    "44341743",
+    "80564076",
+    "14904223",
+    "61241196",
+    "5179001",
+    "75763885",
+    "45066507",
+    "43895830",
+    "69551594",
+    "87010037",
+    "66645142",
+    "105616053",
+    "90048840",
+    "44165603",
+    "44396898",
+    "5276096",
+    "1881065",
+    "84776129",
+    "5355298",
+    "44430674",
+    "103861986",
+    "93502001",
+    "33466561",
+    "72681493"
+  ],
+  "66336898": [
+    "66336824",
+    "34249922",
+    "44193822",
+    "806116",
+    "74937",
+    "62201555",
+    "45975516",
+    "67250442",
+    "71716600",
+    "43894634",
+    "68442322",
+    "75066704",
+    "44399553",
+    "7881264",
+    "7058578",
+    "85834586",
+    "71288419",
+    "76963351",
+    "45623535",
+    "65451695",
+    "75750109",
+    "94003570",
+    "44468240",
+    "32098459",
+    "43507576",
+    "95717994",
+    "82046827",
+    "1744101",
+    "93736224",
+    "77857414"
+  ],
+  "105235513": [
+    "97452767",
+    "86303585",
+    "73495202",
+    "100476640",
+    "43628135",
+    "6388780",
+    "88396353",
+    "45557281",
+    "1715585",
+    "99616608",
+    "86007260",
+    "103610533",
+    "43536913",
+    "68577226",
+    "62031984",
+    "6423024",
+    "75036232",
+    "89387100",
+    "43885294",
+    "75948573",
+    "91500742",
+    "108870888",
+    "62287175",
+    "60048046",
+    "85370361",
+    "73104347",
+    "36635271",
+    "73572124",
+    "79756434",
+    "68290564"
+  ],
+  "77017157": [
+    "44375254",
+    "99285210",
+    "87988242",
+    "86238540",
+    "43403046",
+    "44690402",
+    "86190175",
+    "60008491",
+    "43345792",
+    "68534353",
+    "5358258",
+    "84024189",
+    "61946161",
+    "44863430",
+    "61244827",
+    "106520041",
+    "92017056",
+    "90445221",
+    "76112334",
+    "44359627",
+    "7104341",
+    "81094405",
+    "88274986",
+    "43728980",
+    "43605816",
+    "22858132",
+    "71005806",
+    "1737723",
+    "76191452",
+    "109532698"
+  ],
+  "106232152": [
+    "2719507",
+    "67854373",
+    "72113421",
+    "71547271",
+    "73350347",
+    "67610829",
+    "65038124",
+    "103770025",
+    "44527852",
+    "73995375",
+    "75319265",
+    "6397305",
+    "73903522",
+    "86342428",
+    "67698815",
+    "61938387",
+    "1923101",
+    "43867066",
+    "106667710",
+    "83331919",
+    "64464686",
+    "106895832",
+    "45808372",
+    "8079418",
+    "80564043",
+    "61119482",
+    "84005413",
+    "79986739",
+    "85472986",
+    "43810028"
+  ],
+  "76353708": [
+    "67682754",
+    "109084253",
+    "71320166",
+    "87799884",
+    "79249992",
+    "43917872",
+    "70492735",
+    "43668360",
+    "68498995",
+    "66038043",
+    "95781568",
+    "87407728",
+    "87854266",
+    "44406914",
+    "45823324",
+    "8011022",
+    "44295009",
+    "68036958",
+    "43591520",
+    "7670536",
+    "88303745",
+    "87707840",
+    "32931094",
+    "6881047",
+    "59339688",
+    "75731982",
+    "44369885",
+    "73095247",
+    "22421913",
+    "60626819"
+  ],
+  "79741091": [
+    "32607171",
+    "22769689",
+    "32370299",
+    "32660850",
+    "46019838",
+    "72004503",
+    "67967480",
+    "84813687",
+    "33434123",
+    "90649732",
+    "80783830",
+    "7842236",
+    "71187327",
+    "81276441",
+    "66130772",
+    "82420795",
+    "90056913",
+    "77861056",
+    "28976620",
+    "33300848",
+    "92514878",
+    "45939997",
+    "33374965",
+    "86456573",
+    "7610722",
+    "33531861",
+    "13993026",
+    "22225138",
+    "43614108",
+    "52035153"
+  ],
+  "86860100": [
+    "86860100",
+    "81692608",
+    "73365388",
+    "73843030",
+    "96849928",
+    "61287850",
+    "76760721",
+    "80708349",
+    "81389608",
+    "79098866",
+    "80775708",
+    "45496451",
+    "80710567",
+    "96208625",
+    "76958369",
+    "86258263",
+    "76279856",
+    "73313390",
+    "76127196",
+    "96379124",
+    "66478420",
+    "59567985",
+    "89793282",
+    "92821182",
+    "98396291",
+    "74805580",
+    "45448810",
+    "61233401",
+    "96681288",
+    "88179467"
+  ],
+  "79045164": [
+    "83843480",
+    "43156809",
+    "93073478",
+    "72741864",
+    "32999955",
+    "80772296",
+    "84338721",
+    "22415814",
+    "69217265",
+    "88080895",
+    "14870284",
+    "32458611",
+    "5083848",
+    "84605013",
+    "29184345",
+    "80753707",
+    "32511812",
+    "31628753",
+    "68627132",
+    "71610617",
+    "61304760",
+    "32459382",
+    "95191554",
+    "21116140",
+    "31663374",
+    "72450581",
+    "21912599",
+    "65179742",
+    "33051108",
+    "33352117"
+  ],
+  "85176564": [
+    "72443837",
+    "79000606",
+    "77527141",
+    "45199654",
+    "81692374",
+    "73833726",
+    "100297053",
+    "88192728",
+    "87866933",
+    "80686648",
+    "32416570",
+    "79971987",
+    "67147862",
+    "70911248",
+    "80770876",
+    "69065347",
+    "96680699",
+    "7279536",
+    "74791199",
+    "89963551",
+    "77926055",
+    "33202790",
+    "72169870",
+    "45096562",
+    "21344722",
+    "86043344",
+    "90367775",
+    "71782269",
+    "21927550",
+    "7623115"
+  ],
+  "74598812": [
+    "74598812",
+    "79540926",
+    "67307419",
+    "96695878",
+    "71716027",
+    "44524425",
+    "87116173",
+    "78616844",
+    "72840131",
+    "91960790",
+    "97024444",
+    "44961149",
+    "94552652",
+    "44520632",
+    "95992945",
+    "94439793",
+    "79335919",
+    "87894878",
+    "86118184",
+    "13982549",
+    "85638842",
+    "85933200",
+    "97867456",
+    "99247566",
+    "88364610",
+    "89362509",
+    "77491493",
+    "93554574",
+    "89131365",
+    "4974218"
+  ],
+  "78286001": [
+    "65432068",
+    "81715794",
+    "72703550",
+    "87559850",
+    "74552790",
+    "90037048",
+    "71872009",
+    "92628690",
+    "86035156",
+    "64948806",
+    "68821941",
+    "15060309",
+    "82906678",
+    "86335187",
+    "68865940",
+    "21833368",
+    "81622078",
+    "7950858",
+    "8046245",
+    "72460497",
+    "83140017",
+    "85700905",
+    "61290118",
+    "66400878",
+    "59354384",
+    "22675326",
+    "85969473",
+    "90271766",
+    "69569655",
+    "86430033"
+  ],
+  "79098180": [
+    "86237859",
+    "86696791",
+    "86301854",
+    "84416764",
+    "69328996",
+    "81822743",
+    "78288530",
+    "1376881",
+    "70371603",
+    "75998807",
+    "18158001",
+    "86536566",
+    "7345574",
+    "42810307",
+    "44417984",
+    "4169775",
+    "87437516",
+    "60044376",
+    "43094034",
+    "42365601",
+    "66025864",
+    "1354473",
+    "43878837",
+    "42886784",
+    "103929967",
+    "45901317",
+    "68722856",
+    "84117280",
+    "82465446",
+    "4082413"
+  ],
+  "78090091": [
+    "45307524",
+    "43533791",
+    "44002297",
+    "44230813",
+    "43830101",
+    "44448545",
+    "59568578",
+    "67700598",
+    "69615854",
+    "73425457",
+    "72990387",
+    "43642277",
+    "43055440",
+    "74340036",
+    "67448176",
+    "98558258",
+    "44302706",
+    "94631184",
+    "43331251",
+    "80640194",
+    "15004529",
+    "102506777",
+    "7602987",
+    "106235322",
+    "92389110",
+    "43912169",
+    "61270623",
+    "93720691",
+    "43473559",
+    "1250801"
+  ],
+  "80155730": [
+    "80155730",
+    "4166228",
+    "44526736",
+    "7552728",
+    "74507997",
+    "66361815",
+    "71124588",
+    "75237231",
+    "44159656",
+    "65179536",
+    "66361826",
+    "67260382",
+    "45097472",
+    "96208467",
+    "81705685",
+    "75606608",
+    "73532504",
+    "67610635",
+    "75057499",
+    "61899721",
+    "79832819",
+    "8064990",
+    "83115863",
+    "87397919",
+    "66396259",
+    "95909144",
+    "92933079",
+    "32691430",
+    "84161041",
+    "78012216"
+  ],
+  "76109734": [
+    "85385104",
+    "43765404",
+    "77836411",
+    "92130365",
+    "74594424",
+    "106884384",
+    "45448491",
+    "43710765",
+    "95594159",
+    "44495876",
+    "98208043",
+    "87667801",
+    "7303675",
+    "74580642",
+    "87811160",
+    "92416430",
+    "43804973",
+    "91962087",
+    "71546239",
+    "91897986",
+    "67177917",
+    "67261744",
+    "86218698",
+    "88597870",
+    "32299928",
+    "7018871",
+    "84687916",
+    "85523256",
+    "86020088",
+    "89980510"
+  ],
+  "106318129": [
+    "81915722",
+    "71772339",
+    "87092702",
+    "83106147",
+    "73657004",
+    "43618418",
+    "33159460",
+    "44586128",
+    "75798279",
+    "72024688",
+    "22135206",
+    "43776935",
+    "22859958",
+    "22666908",
+    "83266142",
+    "99268783",
+    "81302611",
+    "98200590",
+    "86236123",
+    "1772511",
+    "93406272",
+    "1882557",
+    "88406090",
+    "84038640",
+    "33448504",
+    "84797878",
+    "6846568",
+    "7077099",
+    "6706486",
+    "5134678"
+  ],
+  "1864211": [
+    "1864211",
+    "72442770",
+    "8059518",
+    "82586094",
+    "89411840",
+    "14904274",
+    "68146754",
+    "43490211",
+    "1694333",
+    "43694603",
+    "92166028",
+    "72958082",
+    "4119507",
+    "44553956",
+    "96329497",
+    "44286898",
+    "73417734",
+    "87324486",
+    "43711365",
+    "62074363",
+    "71119422",
+    "18233366",
+    "42841161",
+    "43822575",
+    "43639144",
+    "74893664",
+    "43955312",
+    "3986038",
+    "1613231",
+    "21083950"
+  ],
+  "61277994": [
+    "71858872",
+    "957307",
+    "61989058",
+    "44408142",
+    "43824316",
+    "44346272",
+    "61243783",
+    "95090315",
+    "94883937",
+    "62081642",
+    "76435367",
+    "110878972",
+    "42463440",
+    "75763896",
+    "87721730",
+    "6480562",
+    "66483344",
+    "43318215",
+    "7545380",
+    "86932488",
+    "82532105",
+    "44432202",
+    "77734762",
+    "84184048",
+    "82777643",
+    "75041109",
+    "44355159",
+    "76149716",
+    "13996136",
+    "22195474"
+  ]
+}

datasets/queries_content_with_features.json ADDED Viewed

The diff for this file is too large to render. See raw diff

datasets/shuffled_pre_ranking.json ADDED Viewed

	@@ -0,0 +1,962 @@

+{
+  "103964109": [
+    "94596291",
+    "65451984",
+    "81098918",
+    "86686331",
+    "82807300",
+    "74999904",
+    "44437432",
+    "70494531",
+    "89655285",
+    "94546339",
+    "84923580",
+    "110338873",
+    "1662314",
+    "87285519",
+    "112489610",
+    "93007218",
+    "74364787",
+    "93196199",
+    "91358966",
+    "73189654",
+    "91801222",
+    "85915967",
+    "102035322",
+    "96138054",
+    "87488738",
+    "101974338",
+    "104761777",
+    "101598636",
+    "105078785",
+    "92631163"
+  ],
+  "79314580": [
+    "99159171",
+    "95503744",
+    "77860027",
+    "93275449",
+    "43379960",
+    "100886158",
+    "97088252",
+    "71185778",
+    "72352339",
+    "74251396",
+    "1692313",
+    "87379239",
+    "1360767",
+    "31705225",
+    "92464359",
+    "72116262",
+    "90742932",
+    "84816716",
+    "93620723",
+    "107646251",
+    "84339687",
+    "43243494",
+    "87321987",
+    "84307890",
+    "68932616",
+    "74928830",
+    "91207030",
+    "68476789",
+    "45747223",
+    "43629694"
+  ],
+  "78061231": [
+    "72681493",
+    "105616053",
+    "66645142",
+    "44264550",
+    "61241196",
+    "69551594",
+    "44165603",
+    "44341743",
+    "87010037",
+    "17816910",
+    "33466561",
+    "75763885",
+    "14904223",
+    "5355298",
+    "82741321",
+    "5179001",
+    "1881065",
+    "93502001",
+    "5276096",
+    "80564076",
+    "43895830",
+    "103861986",
+    "82567288",
+    "90048840",
+    "84776129",
+    "44430674",
+    "45789888",
+    "45066507",
+    "44396898",
+    "87988246"
+  ],
+  "72214279": [
+    "4112132",
+    "7126783",
+    "66704221",
+    "77750612",
+    "97541974",
+    "1024924",
+    "73313951",
+    "70443794",
+    "66859488",
+    "7024647",
+    "7186024",
+    "7323805",
+    "74719750",
+    "78285896",
+    "76635780",
+    "78874284",
+    "80491195",
+    "66704376",
+    "7409445",
+    "61941302",
+    "95087186",
+    "62204608",
+    "7185266",
+    "95793433",
+    "73575071",
+    "5188978",
+    "86201489",
+    "43855508",
+    "6786492",
+    "1052098"
+  ],
+  "68249923": [
+    "35246469",
+    "80619246",
+    "7191590",
+    "7503954",
+    "66204675",
+    "96103327",
+    "84151582",
+    "94133702",
+    "22791064",
+    "17811615",
+    "22538180",
+    "44330428",
+    "84067048",
+    "66785381",
+    "89264734",
+    "1880329",
+    "80790764",
+    "35300504",
+    "68984248",
+    "92458890",
+    "89313603",
+    "86319625",
+    "13900435",
+    "4811621",
+    "95326577",
+    "45407719",
+    "4102206",
+    "86874453",
+    "75001201",
+    "43783883"
+  ],
+  "66336898": [
+    "7881264",
+    "71288419",
+    "44193822",
+    "67250442",
+    "76963351",
+    "82046827",
+    "77857414",
+    "74937",
+    "44468240",
+    "45975516",
+    "71716600",
+    "45623535",
+    "85834586",
+    "94003570",
+    "1744101",
+    "806116",
+    "7058578",
+    "43894634",
+    "68442322",
+    "62201555",
+    "95717994",
+    "43507576",
+    "75066704",
+    "44399553",
+    "66336824",
+    "65451695",
+    "93736224",
+    "75750109",
+    "32098459",
+    "34249922"
+  ],
+  "105235513": [
+    "99616608",
+    "6423024",
+    "91500742",
+    "60048046",
+    "1715585",
+    "108870888",
+    "68577226",
+    "43628135",
+    "45557281",
+    "88396353",
+    "97452767",
+    "89387100",
+    "73495202",
+    "73572124",
+    "75948573",
+    "62031984",
+    "43536913",
+    "79756434",
+    "73104347",
+    "43885294",
+    "86007260",
+    "103610533",
+    "86303585",
+    "62287175",
+    "68290564",
+    "75036232",
+    "85370361",
+    "36635271",
+    "6388780",
+    "100476640"
+  ],
+  "79740635": [
+    "72282319",
+    "80126462",
+    "6507475",
+    "69278139",
+    "75024859",
+    "7937000",
+    "20176022",
+    "77144959",
+    "94502562",
+    "82301362",
+    "64815601",
+    "22271872",
+    "77324818",
+    "69050704",
+    "7582109",
+    "83374518",
+    "64764746",
+    "89202146",
+    "80781908",
+    "32251980",
+    "77783945",
+    "7332283",
+    "97072553",
+    "43110403",
+    "79096739",
+    "13938283",
+    "62141731",
+    "80463331",
+    "89202129",
+    "98938772"
+  ],
+  "100251983": [
+    "82028496",
+    "89295990",
+    "78872192",
+    "33473034",
+    "100515176",
+    "93007808",
+    "100251983",
+    "90670288",
+    "96091957",
+    "94669658",
+    "80167129",
+    "45282121",
+    "104436949",
+    "97292661",
+    "107325691",
+    "89672703",
+    "82206845",
+    "93331965",
+    "106996649",
+    "90490909",
+    "86073872",
+    "94029448",
+    "88392348",
+    "98600374",
+    "96059580",
+    "72140435",
+    "91560811",
+    "46598360",
+    "78547811",
+    "88463494"
+  ],
+  "77017157": [
+    "76191452",
+    "44359627",
+    "43728980",
+    "86238540",
+    "84024189",
+    "88274986",
+    "60008491",
+    "61244827",
+    "44375254",
+    "76112334",
+    "43403046",
+    "5358258",
+    "7104341",
+    "43345792",
+    "61946161",
+    "22858132",
+    "90445221",
+    "44690402",
+    "81094405",
+    "71005806",
+    "87988242",
+    "86190175",
+    "1737723",
+    "99285210",
+    "109532698",
+    "92017056",
+    "44863430",
+    "106520041",
+    "43605816",
+    "68534353"
+  ],
+  "106232152": [
+    "64464686",
+    "43867066",
+    "45808372",
+    "103770025",
+    "72113421",
+    "84005413",
+    "43810028",
+    "73350347",
+    "67854373",
+    "73995375",
+    "75319265",
+    "79986739",
+    "71547271",
+    "44527852",
+    "8079418",
+    "67610829",
+    "61119482",
+    "2719507",
+    "73903522",
+    "106667710",
+    "67698815",
+    "106895832",
+    "80564043",
+    "65038124",
+    "61938387",
+    "6397305",
+    "83331919",
+    "1923101",
+    "85472986",
+    "86342428"
+  ],
+  "76353708": [
+    "87707840",
+    "87854266",
+    "43591520",
+    "79249992",
+    "95781568",
+    "32931094",
+    "87407728",
+    "70492735",
+    "6881047",
+    "66038043",
+    "73095247",
+    "7670536",
+    "43917872",
+    "44295009",
+    "44369885",
+    "45823324",
+    "71320166",
+    "44406914",
+    "22421913",
+    "88303745",
+    "59339688",
+    "68036958",
+    "60626819",
+    "43668360",
+    "87799884",
+    "67682754",
+    "68498995",
+    "109084253",
+    "75731982",
+    "8011022"
+  ],
+  "79741091": [
+    "46019838",
+    "71187327",
+    "52035153",
+    "7842236",
+    "22769689",
+    "84813687",
+    "13993026",
+    "82420795",
+    "66130772",
+    "80783830",
+    "72004503",
+    "7610722",
+    "33374965",
+    "43614108",
+    "67967480",
+    "32660850",
+    "32607171",
+    "81276441",
+    "90056913",
+    "28976620",
+    "32370299",
+    "92514878",
+    "86456573",
+    "45939997",
+    "90649732",
+    "22225138",
+    "77861056",
+    "33300848",
+    "33531861",
+    "33434123"
+  ],
+  "86860100": [
+    "98396291",
+    "89793282",
+    "80708349",
+    "96681288",
+    "76279856",
+    "76127196",
+    "86258263",
+    "79098866",
+    "76958369",
+    "80775708",
+    "61233401",
+    "80710567",
+    "73365388",
+    "66478420",
+    "74805580",
+    "76760721",
+    "96208625",
+    "73313390",
+    "81692608",
+    "92821182",
+    "86860100",
+    "96849928",
+    "88179467",
+    "59567985",
+    "45448810",
+    "61287850",
+    "73843030",
+    "81389608",
+    "96379124",
+    "45496451"
+  ],
+  "79045164": [
+    "21116140",
+    "29184345",
+    "33352117",
+    "80753707",
+    "65179742",
+    "68627132",
+    "32999955",
+    "22415814",
+    "71610617",
+    "83843480",
+    "33051108",
+    "31628753",
+    "32511812",
+    "93073478",
+    "32459382",
+    "5083848",
+    "43156809",
+    "88080895",
+    "14870284",
+    "21912599",
+    "69217265",
+    "32458611",
+    "61304760",
+    "80772296",
+    "72450581",
+    "31663374",
+    "95191554",
+    "84605013",
+    "84338721",
+    "72741864"
+  ],
+  "85176564": [
+    "45096562",
+    "32416570",
+    "67147862",
+    "21344722",
+    "73833726",
+    "79000606",
+    "70911248",
+    "100297053",
+    "71782269",
+    "72443837",
+    "87866933",
+    "69065347",
+    "21927550",
+    "86043344",
+    "80686648",
+    "89963551",
+    "7279536",
+    "90367775",
+    "80770876",
+    "81692374",
+    "72169870",
+    "33202790",
+    "74791199",
+    "96680699",
+    "45199654",
+    "77527141",
+    "88192728",
+    "77926055",
+    "7623115",
+    "79971987"
+  ],
+  "74598812": [
+    "78616844",
+    "44961149",
+    "91960790",
+    "79540926",
+    "72840131",
+    "99247566",
+    "94552652",
+    "89362509",
+    "79335919",
+    "71716027",
+    "95992945",
+    "86118184",
+    "74598812",
+    "87116173",
+    "93554574",
+    "13982549",
+    "88364610",
+    "97024444",
+    "44520632",
+    "4974218",
+    "87894878",
+    "85933200",
+    "97867456",
+    "85638842",
+    "89131365",
+    "67307419",
+    "44524425",
+    "77491493",
+    "94439793",
+    "96695878"
+  ],
+  "76109416": [
+    "67929958",
+    "1212129",
+    "74871145",
+    "92580522",
+    "44149931",
+    "96952506",
+    "75226918",
+    "76109416",
+    "87097786",
+    "90897974",
+    "89081482",
+    "43237025",
+    "43845712",
+    "44268155",
+    "72024697",
+    "68498126",
+    "85956111",
+    "86120389",
+    "86772879",
+    "71010602",
+    "6825461",
+    "66358",
+    "86586319",
+    "45066980",
+    "79148717",
+    "1859328",
+    "83775044",
+    "98417973",
+    "83915124",
+    "84518942"
+  ],
+  "78286001": [
+    "86430033",
+    "69569655",
+    "87559850",
+    "90037048",
+    "7950858",
+    "59354384",
+    "8046245",
+    "68821941",
+    "65432068",
+    "81715794",
+    "66400878",
+    "86035156",
+    "86335187",
+    "64948806",
+    "15060309",
+    "68865940",
+    "90271766",
+    "72460497",
+    "21833368",
+    "82906678",
+    "74552790",
+    "92628690",
+    "72703550",
+    "85969473",
+    "81622078",
+    "71872009",
+    "83140017",
+    "85700905",
+    "22675326",
+    "61290118"
+  ],
+  "79098180": [
+    "84117280",
+    "86237859",
+    "44417984",
+    "43094034",
+    "45901317",
+    "42886784",
+    "7345574",
+    "43878837",
+    "1354473",
+    "86301854",
+    "66025864",
+    "86696791",
+    "42365601",
+    "86536566",
+    "18158001",
+    "70371603",
+    "1376881",
+    "81822743",
+    "4082413",
+    "4169775",
+    "69328996",
+    "68722856",
+    "60044376",
+    "84416764",
+    "75998807",
+    "42810307",
+    "78288530",
+    "103929967",
+    "82465446",
+    "87437516"
+  ],
+  "85685768": [
+    "5399125",
+    "103242931",
+    "96992323",
+    "79386554",
+    "91467629",
+    "101540987",
+    "94616956",
+    "101485622",
+    "43287174",
+    "90415246",
+    "91796062",
+    "43124738",
+    "92906077",
+    "86711677",
+    "91034639",
+    "68036945",
+    "100252809",
+    "6716158",
+    "22854327",
+    "91230085",
+    "44105210",
+    "80775193",
+    "92933013",
+    "7874471",
+    "93463606",
+    "98140923",
+    "87659292",
+    "71483822",
+    "92322829",
+    "22926275"
+  ],
+  "78090091": [
+    "72990387",
+    "106235322",
+    "44002297",
+    "43533791",
+    "1250801",
+    "93720691",
+    "43473559",
+    "44448545",
+    "94631184",
+    "74340036",
+    "43912169",
+    "15004529",
+    "44230813",
+    "43830101",
+    "43642277",
+    "92389110",
+    "44302706",
+    "59568578",
+    "67448176",
+    "61270623",
+    "45307524",
+    "7602987",
+    "69615854",
+    "43331251",
+    "102506777",
+    "98558258",
+    "43055440",
+    "73425457",
+    "67700598",
+    "80640194"
+  ],
+  "80155730": [
+    "67610635",
+    "92933079",
+    "75057499",
+    "45097472",
+    "87397919",
+    "7552728",
+    "8064990",
+    "83115863",
+    "74507997",
+    "96208467",
+    "79832819",
+    "66361826",
+    "71124588",
+    "78012216",
+    "75237231",
+    "65179536",
+    "75606608",
+    "80155730",
+    "81705685",
+    "44159656",
+    "66361815",
+    "67260382",
+    "61899721",
+    "32691430",
+    "84161041",
+    "95909144",
+    "73532504",
+    "66396259",
+    "44526736",
+    "4166228"
+  ],
+  "70563808": [
+    "76578541",
+    "66791385",
+    "44519425",
+    "13902208",
+    "75429864",
+    "59356913",
+    "45525493",
+    "14687460",
+    "7305297",
+    "87318046",
+    "72133315",
+    "62199485",
+    "7445616",
+    "105795426",
+    "7728171",
+    "61287856",
+    "6383581",
+    "6936022",
+    "43891147",
+    "7706134",
+    "70563808",
+    "1651911",
+    "77540953",
+    "7807387",
+    "46288752",
+    "78164659",
+    "7949497",
+    "45039835",
+    "45760699",
+    "8125158"
+  ],
+  "79482665": [
+    "87571623",
+    "92503480",
+    "68914227",
+    "81582469",
+    "102635100",
+    "88449039",
+    "67370804",
+    "73293133",
+    "95908606",
+    "93592423",
+    "6903999",
+    "71022105",
+    "87300637",
+    "72279718",
+    "73497949",
+    "88456909",
+    "87107633",
+    "93318781",
+    "86402797",
+    "93524596",
+    "46019842",
+    "44666167",
+    "110233006",
+    "103621196",
+    "84540103",
+    "93279107",
+    "70930284",
+    "89030165",
+    "4095476",
+    "83083320"
+  ],
+  "76109734": [
+    "71546239",
+    "67177917",
+    "106884384",
+    "95594159",
+    "43804973",
+    "85385104",
+    "98208043",
+    "92416430",
+    "43765404",
+    "7303675",
+    "74580642",
+    "74594424",
+    "88597870",
+    "92130365",
+    "45448491",
+    "32299928",
+    "85523256",
+    "87667801",
+    "91897986",
+    "86218698",
+    "87811160",
+    "43710765",
+    "86020088",
+    "67261744",
+    "91962087",
+    "89980510",
+    "44495876",
+    "77836411",
+    "7018871",
+    "84687916"
+  ],
+  "106318129": [
+    "81915722",
+    "22135206",
+    "88406090",
+    "86236123",
+    "33159460",
+    "6706486",
+    "7077099",
+    "73657004",
+    "71772339",
+    "6846568",
+    "83266142",
+    "43618418",
+    "83106147",
+    "33448504",
+    "99268783",
+    "1772511",
+    "87092702",
+    "22666908",
+    "72024688",
+    "1882557",
+    "98200590",
+    "43776935",
+    "5134678",
+    "75798279",
+    "22859958",
+    "93406272",
+    "81302611",
+    "84038640",
+    "84797878",
+    "44586128"
+  ],
+  "1864211": [
+    "74893664",
+    "14904274",
+    "73417734",
+    "43490211",
+    "82586094",
+    "18233366",
+    "1694333",
+    "1613231",
+    "8059518",
+    "62074363",
+    "4119507",
+    "44286898",
+    "43822575",
+    "71119422",
+    "43694603",
+    "87324486",
+    "3986038",
+    "1864211",
+    "68146754",
+    "43711365",
+    "72958082",
+    "44553956",
+    "89411840",
+    "21083950",
+    "92166028",
+    "96329497",
+    "43955312",
+    "43639144",
+    "42841161",
+    "72442770"
+  ],
+  "75800075": [
+    "33464411",
+    "34284570",
+    "74966633",
+    "71238892",
+    "84214328",
+    "73305870",
+    "77197418",
+    "64972313",
+    "93085483",
+    "61870386",
+    "73750287",
+    "35215942",
+    "75692075",
+    "77144269",
+    "35197610",
+    "43878177",
+    "76120190",
+    "81692381",
+    "43687538",
+    "62288211",
+    "70999237",
+    "76825949",
+    "7588356",
+    "34173412",
+    "43796291",
+    "62194904",
+    "81704710",
+    "22823110",
+    "7228433",
+    "86183849"
+  ],
+  "61277994": [
+    "13996136",
+    "87721730",
+    "43824316",
+    "82532105",
+    "44408142",
+    "61989058",
+    "62081642",
+    "82777643",
+    "86932488",
+    "44355159",
+    "957307",
+    "43318215",
+    "61243783",
+    "110878972",
+    "66483344",
+    "76149716",
+    "22195474",
+    "95090315",
+    "76435367",
+    "44432202",
+    "6480562",
+    "84184048",
+    "7545380",
+    "44346272",
+    "75763896",
+    "94883937",
+    "75041109",
+    "42463440",
+    "71858872",
+    "77734762"
+  ]
+}

datasets/test_queries.json ADDED Viewed

	@@ -0,0 +1,12 @@

+[
+  "76109416",
+  "75800075",
+  "68249923",
+  "79482665",
+  "79740635",
+  "100251983",
+  "70563808",
+  "103964109",
+  "72214279",
+  "85685768"
+]

datasets/train_gold_mapping.json ADDED Viewed

	@@ -0,0 +1,111 @@

+{
+  "79314580": [
+    "43379960",
+    "68932616",
+    "43243494"
+  ],
+  "78061231": [
+    "44396898",
+    "45789888",
+    "44165603",
+    "44264550"
+  ],
+  "66336898": [
+    "32098459",
+    "44468240",
+    "74937",
+    "62201555",
+    "44193822",
+    "806116"
+  ],
+  "105235513": [
+    "85370361"
+  ],
+  "77017157": [
+    "61244827",
+    "71005806",
+    "61946161"
+  ],
+  "106232152": [
+    "84005413",
+    "73995375",
+    "45808372",
+    "86342428"
+  ],
+  "76353708": [
+    "71320166",
+    "59339688",
+    "44295009",
+    "22421913",
+    "32931094",
+    "60626819",
+    "44369885"
+  ],
+  "79741091": [
+    "33434123"
+  ],
+  "86860100": [
+    "66478420",
+    "76958369"
+  ],
+  "79045164": [
+    "31628753",
+    "72450581",
+    "5083848",
+    "33352117",
+    "72741864"
+  ],
+  "85176564": [
+    "69065347",
+    "70911248"
+  ],
+  "74598812": [
+    "72840131"
+  ],
+  "78286001": [
+    "72460497",
+    "68865940",
+    "68821941"
+  ],
+  "79098180": [
+    "1376881",
+    "68722856",
+    "7345574"
+  ],
+  "78090091": [
+    "43331251",
+    "72990387",
+    "44002297",
+    "43473559",
+    "44230813"
+  ],
+  "80155730": [
+    "67610635",
+    "66361815"
+  ],
+  "76109734": [
+    "67177917",
+    "67261744",
+    "43710765",
+    "32299928",
+    "43804973"
+  ],
+  "106318129": [
+    "88406090",
+    "22859958",
+    "44586128"
+  ],
+  "1864211": [
+    "44553956",
+    "14904274",
+    "21083950",
+    "1613231",
+    "43490211",
+    "43955312"
+  ],
+  "61277994": [
+    "44355159",
+    "44432202",
+    "957307"
+  ]
+}

datasets/train_queries.json ADDED Viewed

	@@ -0,0 +1,22 @@

+[
+  "79098180",
+  "79045164",
+  "106232152",
+  "106318129",
+  "80155730",
+  "105235513",
+  "66336898",
+  "79741091",
+  "76353708",
+  "85176564",
+  "77017157",
+  "76109734",
+  "61277994",
+  "78090091",
+  "74598812",
+  "1864211",
+  "79314580",
+  "86860100",
+  "78286001",
+  "78061231"
+]

evaluate_train_rankings.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import os
+import json
+import argparse
+import sys
+# Import metrics directly from the local file
+from metrics import mean_recall_at_k, mean_average_precision, mean_inv_ranking, mean_ranking
+def load_json_file(file_path):
+    """Load JSON data from a file"""
+    with open(file_path, 'r') as f:
+        return json.load(f)
+def main():
+    parser = argparse.ArgumentParser(description='Evaluate document ranking performance on training data')
+    parser.add_argument('--pre_ranking', type=str, default='shuffled_pre_ranking.json',
+                        help='Path to pre-ranking JSON file')
+    parser.add_argument('--re_ranking', type=str, default='predictions2.json',
+                        help='Path to re-ranked JSON file')
+    parser.add_argument('--gold', type=str, default='train_gold_mapping.json',
+                        help='Path to gold standard mapping JSON file (training only)')
+    parser.add_argument('--train_queries', type=str, default='train_queries.json',
+                        help='Path to training queries JSON file')
+    parser.add_argument('--k_values', type=str, default='3,5,10,20',
+                        help='Comma-separated list of k values for Recall@k')
+    parser.add_argument('--base_dir', type=str,
+                        default='datasets',
+                        help='Base directory for data files')
+    args = parser.parse_args()
+    # Ensure all paths are relative to base_dir if they're not absolute
+    def get_full_path(path):
+        if os.path.isabs(path):
+            return path
+        return os.path.join(args.base_dir, path)
+    # Load the training queries
+    print("Loading training queries...")
+    train_queries = load_json_file(get_full_path(args.train_queries))
+    print(f"Loaded {len(train_queries)} training queries")
+    # Load the ranking data and gold standard
+    print("Loading ranking data and gold standard...")
+    pre_ranking = load_json_file(get_full_path(args.pre_ranking))
+    re_ranking = load_json_file(get_full_path(args.re_ranking))
+    gold_mapping = load_json_file(get_full_path(args.gold))
+    # Filter to include only training queries
+    pre_ranking = {fan: docs for fan, docs in pre_ranking.items() if fan in train_queries}
+    re_ranking = {fan: docs for fan, docs in re_ranking.items() if fan in train_queries}  # Fixed this line
+    gold_mapping = {fan: docs for fan, docs in gold_mapping.items() if fan in train_queries}
+    # Parse k values
+    k_values = [int(k) for k in args.k_values.split(',')]
+    # Prepare data for metrics calculation
+    query_fans = set(gold_mapping.keys()) & set(pre_ranking.keys()) & set(re_ranking.keys())
+    if not query_fans:
+        print("Error: No common query FANs found across all datasets!")
+        return
+    print(f"Evaluating rankings for {len(query_fans)} training queries...")
+    # Extract true and predicted labels for both rankings
+    true_labels = [gold_mapping[fan] for fan in query_fans]
+    pre_ranking_labels = [pre_ranking[fan] for fan in query_fans]
+    re_ranking_labels = [re_ranking[fan] for fan in query_fans]
+    # Calculate metrics for pre-ranking
+    print("\nPre-ranking performance (training queries only):")
+    for k in k_values:
+        recall_at_k = mean_recall_at_k(true_labels, pre_ranking_labels, k=k)
+        print(f"  Recall@{k}: {recall_at_k:.4f}")
+    map_score = mean_average_precision(true_labels, pre_ranking_labels)
+    print(f"  MAP: {map_score:.4f}")
+    inv_rank = mean_inv_ranking(true_labels, pre_ranking_labels)
+    print(f"  Mean Inverse Rank: {inv_rank:.4f}")
+    rank = mean_ranking(true_labels, pre_ranking_labels)
+    print(f"  Mean Rank: {rank:.2f}")
+    # Calculate metrics for re-ranking
+    print("\nRe-ranking performance (training queries only):")
+    for k in k_values:
+        recall_at_k = mean_recall_at_k(true_labels, re_ranking_labels, k=k)
+        print(f"  Recall@{k}: {recall_at_k:.4f}")
+    map_score = mean_average_precision(true_labels, re_ranking_labels)
+    print(f"  MAP: {map_score:.4f}")
+    inv_rank = mean_inv_ranking(true_labels, re_ranking_labels)
+    print(f"  Mean Inverse Rank: {inv_rank:.4f}")
+    rank = mean_ranking(true_labels, re_ranking_labels)
+    print(f"  Mean Rank: {rank:.2f}")
+if __name__ == "__main__":
+    main()

metrics.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""
+Evaluation metrics for document ranking.
+This file contains implementation of various evaluation metrics
+for assessing the quality of document rankings.
+"""
+import numpy as np
+def recall_at_k(true_items, predicted_items, k=10):
+    """
+    Calculate recall at k for a single query.
+    Parameters:
+    true_items (list): List of true relevant items
+    predicted_items (list): List of predicted items (ranked)
+    k (int): Number of top items to consider
+    Returns:
+    float: Recall@k value between 0 and 1
+    """
+    if not true_items:
+        return 0.0  # No relevant items to recall
+    # Get the top k predicted items
+    top_k_items = predicted_items[:k]
+    # Count the number of true items in the top k predictions
+    relevant_in_top_k = sum(1 for item in top_k_items if item in true_items)
+    # Calculate recall: (relevant items in top k) / (total relevant items)
+    return relevant_in_top_k / len(true_items)
+def mean_recall_at_k(true_items_list, predicted_items_list, k=10):
+    """
+    Calculate mean recall at k across multiple queries.
+    Parameters:
+    true_items_list (list of lists): List of true relevant items for each query
+    predicted_items_list (list of lists): List of predicted items for each query
+    k (int): Number of top items to consider
+    Returns:
+    float: Mean Recall@k value between 0 and 1
+    """
+    if len(true_items_list) != len(predicted_items_list):
+        raise ValueError("Number of true item lists must match number of predicted item lists")
+    if not true_items_list:
+        return 0.0  # No data provided
+    # Calculate recall@k for each query
+    recalls = [recall_at_k(true_items, predicted_items, k)
+               for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
+    # Return mean recall@k
+    return sum(recalls) / len(recalls)
+def average_precision(true_items, predicted_items):
+    """
+    Calculate average precision for a single query.
+    Parameters:
+    true_items (list): List of true relevant items
+    predicted_items (list): List of predicted items (ranked)
+    Returns:
+    float: Average precision value between 0 and 1
+    """
+    if not true_items or not predicted_items:
+        return 0.0
+    # Track number of relevant items seen and running sum of precision values
+    relevant_count = 0
+    precision_sum = 0.0
+    # Calculate precision at each position where a relevant item is found
+    for i, item in enumerate(predicted_items):
+        position = i + 1  # 1-indexed position
+        if item in true_items:
+            relevant_count += 1
+            # Precision at this position = relevant items seen / position
+            precision_at_position = relevant_count / position
+            precision_sum += precision_at_position
+    # Average precision = sum of precision values / total relevant items
+    total_relevant = len(true_items)
+    return precision_sum / total_relevant if total_relevant > 0 else 0.0
+def mean_average_precision(true_items_list, predicted_items_list):
+    """
+    Calculate mean average precision (MAP) across multiple queries.
+    Parameters:
+    true_items_list (list of lists): List of true relevant items for each query
+    predicted_items_list (list of lists): List of predicted items for each query
+    Returns:
+    float: MAP value between 0 and 1
+    """
+    if len(true_items_list) != len(predicted_items_list):
+        raise ValueError("Number of true item lists must match number of predicted item lists")
+    if not true_items_list:
+        return 0.0  # No data provided
+    # Calculate average precision for each query
+    aps = [average_precision(true_items, predicted_items)
+           for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
+    # Return mean average precision
+    return sum(aps) / len(aps)
+def inverse_ranking(true_items, predicted_items):
+    """
+    Calculate inverse ranking for the first relevant item.
+    Parameters:
+    true_items (list): List of true relevant items
+    predicted_items (list): List of predicted items (ranked)
+    Returns:
+    float: Inverse ranking value between 0 and 1
+    """
+    if not true_items or not predicted_items:
+        return 0.0
+    # Find position of first relevant item (1-indexed)
+    for i, item in enumerate(predicted_items):
+        if item in true_items:
+            rank = i + 1
+            return 1.0 / rank  # Inverse ranking
+    # No relevant items found in predictions
+    return 0.0
+def mean_inv_ranking(true_items_list, predicted_items_list):
+    """
+    Calculate mean inverse ranking (MIR) across multiple queries.
+    Parameters:
+    true_items_list (list of lists): List of true relevant items for each query
+    predicted_items_list (list of lists): List of predicted items for each query
+    Returns:
+    float: MIR value between 0 and 1
+    """
+    if len(true_items_list) != len(predicted_items_list):
+        raise ValueError("Number of true item lists must match number of predicted item lists")
+    if not true_items_list:
+        return 0.0  # No data provided
+    # Calculate inverse ranking for each query
+    inv_ranks = [inverse_ranking(true_items, predicted_items)
+                 for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
+    # Return mean inverse ranking
+    return sum(inv_ranks) / len(inv_ranks)
+def ranking(true_items, predicted_items):
+    """
+    Calculate the rank of the first relevant item.
+    Parameters:
+    true_items (list): List of true relevant items
+    predicted_items (list): List of predicted items (ranked)
+    Returns:
+    float: Rank of the first relevant item (1-indexed)
+    """
+    if not true_items or not predicted_items:
+        return float('inf')  # No relevant items to find
+    # Find position of first relevant item (1-indexed)
+    for i, item in enumerate(predicted_items):
+        if item in true_items:
+            return i + 1  # Return rank (1-indexed)
+    # No relevant items found in predictions
+    return float('inf')
+def mean_ranking(true_items_list, predicted_items_list):
+    """
+    Calculate mean ranking across multiple queries.
+    Parameters:
+    true_items_list (list of lists): List of true relevant items for each query
+    predicted_items_list (list of lists): List of predicted items for each query
+    Returns:
+    float: Mean ranking value (higher is worse)
+    """
+    if len(true_items_list) != len(predicted_items_list):
+        raise ValueError("Number of true item lists must match number of predicted item lists")
+    if not true_items_list:
+        return float('inf')  # No data provided
+    # Calculate ranking for each query
+    ranks = [ranking(true_items, predicted_items)
+             for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
+    # Filter out 'inf' values for mean calculation
+    finite_ranks = [r for r in ranks if r != float('inf')]
+    # Return mean ranking
+    return sum(finite_ranks) / len(finite_ranks) if finite_ranks else float('inf')

requirements.txt ADDED Viewed

	@@ -0,0 +1,73 @@

+appnope==0.1.4
+asttokens==3.0.0
+certifi==2025.1.31
+charset-normalizer==3.4.1
+comm==0.2.2
+contourpy==1.3.1
+cycler==0.12.1
+debugpy==1.8.13
+decorator==5.2.1
+executing==2.2.0
+filelock==3.18.0
+fonttools==4.57.0
+fsspec==2025.3.2
+huggingface-hub==0.30.1
+idna==3.10
+ipykernel==6.29.5
+ipython==9.0.2
+ipython_pygments_lexers==1.1.1
+jedi==0.19.2
+Jinja2==3.1.6
+joblib==1.4.2
+jupyter_client==8.6.3
+jupyter_core==5.7.2
+kiwisolver==1.4.8
+llvmlite==0.44.0
+MarkupSafe==3.0.2
+matplotlib==3.10.1
+matplotlib-inline==0.1.7
+mpmath==1.3.0
+nest-asyncio==1.6.0
+networkx==3.4.2
+numba==0.61.0
+numpy==2.1.0
+packaging==24.2
+pandas==2.2.3
+parso==0.8.4
+pexpect==4.9.0
+pillow==11.1.0
+platformdirs==4.3.7
+prompt_toolkit==3.0.50
+psutil==7.0.0
+ptyprocess==0.7.0
+pure_eval==0.2.3
+Pygments==2.19.1
+pynndescent==0.5.13
+pyparsing==3.2.3
+python-dateutil==2.9.0.post0
+pytz==2025.2
+PyYAML==6.0.2
+pyzmq==26.4.0
+regex==2024.11.6
+requests==2.32.3
+safetensors==0.5.3
+scikit-learn==1.6.1
+scipy==1.15.2
+seaborn==0.13.2
+sentence-transformers==4.0.2
+setuptools==78.1.0
+six==1.17.0
+stack-data==0.6.3
+sympy==1.13.1
+threadpoolctl==3.6.0
+tokenizers==0.21.1
+torch==2.6.0
+tornado==6.4.2
+tqdm==4.67.1
+traitlets==5.14.3
+transformers==4.50.3
+typing_extensions==4.13.1
+tzdata==2025.2
+umap-learn==0.5.7
+urllib3==2.3.0
+wcwidth==0.2.13