darpanaswal commited on
Commit
3b98ef0
·
verified ·
1 Parent(s): c875dd8

Upload 12 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ datasets/documents_content_with_features.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,11 +1,128 @@
1
- ---
2
- title: Patent Retrieval
3
- emoji: 📊
4
- colorFrom: red
5
- colorTo: yellow
6
- sdk: static
7
- pinned: false
8
- short_description: Space for Information Retrieval Contest
9
- ---
10
-
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PATENT-MATCH PREDICTION 2024-2025 - Task 2: Re-ranking
2
+
3
+ This directory contains the training data and scripts for the second task of the PATENT-MATCH PREDICTION challenge, which focuses on reranking patent documents based on their relevance to query patents.
4
+
5
+ ## Directory Contents
6
+
7
+ ### Data Files:
8
+ - `train_queries.json` - List of query patent IDs for training
9
+ - `test_queries.json` - List of query patent IDs for testing (gold not accessible during the challenge)
10
+ - `train_gold_mapping.json` - Gold standard mappings of relevant documents for each training query
11
+ - `shuffled_pre_ranking.json` - Initial random ranking of documents for each query
12
+ - `queries_content_with_features.json` - Content of the query patents with LLM-extracted features
13
+ - `documents_content_with_features.json` - Content of the candidate documents with LLM-extracted features
14
+
15
+ Both content files include an additional key called `features` in the Content column. These are LLM-extracted features from the claim set that encapsulate the main features of the invention. You can use these features alone or in combination with other parts of the patent to boost your re-ranking performance.
16
+
17
+ ### Scripts:
18
+ - `cross_encoder_reranking_train.py` - Script for reranking documents using cross-encoder models (training data only)
19
+ - `evaluate_train_rankings.py` - Evaluation script to measure ranking performance (training data only)
20
+ - `metrics.py` - Implementation of evaluation metrics (Recall@k, MAP, etc.)
21
+
22
+ ## Task Overview
23
+
24
+ This re-ranking task is more approachable than task 1 because:
25
+ - We already provide you with a pre-ranking of 30 relevant patents for each query
26
+ - Your document corpus for each query is limited to just 30 documents (compared to thousands in task 1 or millions in real-life scenarios)
27
+
28
+ The challenge uses 20 queries for training (with gold mappings provided) and 10 queries for testing (gold mappings not provided during the challenge).
29
+
30
+ ## Objectives
31
+
32
+ 1. Apply dense retrieval methods learned in task 1 to achieve decent baseline results
33
+ 2. Develop custom embedding, scoring, and evaluation scripts
34
+ 3. Experiment with different text representations (including the provided features)
35
+ 4. Implement and compare various ranking methodologies
36
+
37
+ All the necessary processing can be done on Google Colab using the free GPUs provided, making this task accessible to everyone.
38
+
39
+ ## Embedding Models
40
+
41
+ You can find various embedding models on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across 56 datasets spanning 8 embedding tasks. This benchmark provides valuable insights into model performance for various retrieval and semantic tasks.
42
+
43
+ On Huggingface, you'll find:
44
+ - Pre-trained embedding models ready for use
45
+ - Model cards with performance metrics and usage instructions
46
+ - Example code snippets to help you implement these models
47
+ - Community discussions about model performance and applications
48
+
49
+ Some top-performing models to consider include E5 variants, BGE models, and various Sentence-Transformers. You can easily load these models using the Huggingface Transformers library.
50
+
51
+ ## Running Cross-Encoder Reranking
52
+
53
+ The cross-encoder reranking script uses pretrained language models to rerank patent documents based on their semantic similarity to query patents. You can use this script directly or adapt it into a notebook for running on Google Colab.
54
+
55
+ ### Basic Usage:
56
+
57
+ ```bash
58
+ python cross_encoder_reranking_train.py --model_name MODEL_NAME --text_type TEXT_TYPE
59
+ ```
60
+
61
+ ### Parameters:
62
+
63
+ - `--model_name`: Name of the transformer model to use (default: intfloat/e5-large-v2)
64
+ - Other options: infly/inf-retriever-v1-1.5b, Linq-AI-Research/Linq-Embed-Mistral
65
+ - `--text_type`: Type of text to extract from patents (default: TA)
66
+ - TA: Title and abstract only
67
+ - tac1: Title, abstract, and first claim
68
+ - claims: All claims
69
+ - description: Patent description
70
+ - full: All patent content
71
+ - features: LLM-extracted features from claims (you need to implement this in extract_text if you wish)
72
+ - `--batch_size`: Batch size for processing (default: 4)
73
+ - `--max_length`: Maximum sequence length (default: 512)
74
+ - `--output`: Output file name (default: predictions2.json)
75
+
76
+ ### Examples:
77
+
78
+ ```bash
79
+ # Basic usage with default parameters
80
+ python cross_encoder_reranking_train.py
81
+
82
+ # Use infly/inf-retriever-v1-1.5b model with title and abstract
83
+ python cross_encoder_reranking_train.py --model_name infly/inf-retriever-v1-1.5b --text_type TA
84
+
85
+ # Use E5 model with title, abstract, and first claim
86
+ python cross_encoder_reranking_train.py --model_name intfloat/e5-large-v2 --text_type tac1 --max_length 512
87
+
88
+ # Use extracted features only
89
+ python cross_encoder_reranking_train.py --text_type features
90
+
91
+ # Use custom batch size and output file
92
+ python cross_encoder_reranking_train.py --batch_size 2 --output prediction2.json
93
+ ```
94
+
95
+ ## Evaluating Results
96
+
97
+ After running the reranking script, you can evaluate the performance using the evaluation script:
98
+
99
+ ```bash
100
+ python evaluate_train_rankings.py --pre_ranking shuffled_pre_ranking.json --re_ranking PREDICTIONS_FILE
101
+ ```
102
+
103
+ ### Parameters:
104
+
105
+ - `--pre_ranking`: Path to pre-ranking file (default: shuffled_pre_ranking.json)
106
+ - `--re_ranking`: Path to re-ranked file from cross-encoder output (default: predictions2.json)
107
+ - `--gold`: Path to gold standard mappings (default: train_gold_mapping.json)
108
+ - `--k_values`: Comma-separated list of k values for Recall@k (default: 3,5,10,20)
109
+
110
+ ### Example:
111
+
112
+ ```bash
113
+ python evaluate_train_rankings.py --re_ranking prediction2.json
114
+ ```
115
+
116
+ ## Output Metrics
117
+
118
+ The evaluation script outputs the following metrics:
119
+ - Recall@k: Percentage of relevant documents found in the top-k results
120
+ - MAP (Mean Average Precision): Average precision across all relevant documents
121
+ - Mean Inverse Rank: Inverse of the rank of the first relevant document (higher is better)
122
+ - Mean Rank: Average position of the first relevant document (lower is better)
123
+
124
+ ## Challenge Task
125
+
126
+ Your task is to improve the reranking of patent documents to maximize the retrieval metrics on the test set. Use the training data to develop and evaluate your approach, then submit your best model for evaluation on the hidden test set.
127
+
128
+ Good luck!
cross_encoder_reranking_train.py ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import argparse
4
+ import numpy as np
5
+ import torch
6
+ import torch.nn.functional as F
7
+ from tqdm import tqdm
8
+ from torch import Tensor
9
+ from sentence_transformers import SentenceTransformer
10
+ from transformers import AutoTokenizer, AutoModel
11
+ from sklearn.cluster import AgglomerativeClustering
12
+ from sklearn.metrics.pairwise import cosine_similarity
13
+
14
+ # Load embedder once
15
+ embedder = SentenceTransformer("all-MiniLM-L6-v2")
16
+
17
+ def embed_text_list(texts):
18
+ return embedder.encode(texts, convert_to_tensor=False)
19
+
20
+ def rank_by_centrality(texts):
21
+ embeddings = embed_text_list(texts)
22
+ similarity_matrix = cosine_similarity(embeddings)
23
+ centrality_scores = similarity_matrix.mean(axis=1)
24
+ ranked = sorted(zip(texts, centrality_scores), key=lambda x: x[1], reverse=True)
25
+ return [text for text, _ in ranked]
26
+
27
+ def cluster_and_rank(texts, threshold=0.75):
28
+ if len(texts) <= 3:
29
+ return texts # Nothing to reduce
30
+
31
+ embeddings = embed_text_list(texts)
32
+ clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=1-threshold, metric = "cosine", linkage='average')
33
+ labels = clustering.fit_predict(embeddings)
34
+
35
+ clustered_texts = {}
36
+ for label, text in zip(labels, texts):
37
+ clustered_texts.setdefault(label, []).append(text)
38
+
39
+ representative_texts = []
40
+ for cluster_texts in clustered_texts.values():
41
+ ranked = rank_by_centrality(cluster_texts)
42
+ representative_texts.append(ranked[0]) # Choose most central per cluster
43
+
44
+ return representative_texts
45
+
46
+ def process_single_patent(patent_dict):
47
+ claims = [v for k, v in patent_dict.items() if k.startswith("c-en")]
48
+ paragraphs = [v for k, v in patent_dict.items() if k.startswith("p")]
49
+ features = [v for k, v in patent_dict.get("features", {}).items()]
50
+
51
+ # Cluster & rank
52
+ top_claims = cluster_and_rank(claims)
53
+ top_paragraphs = cluster_and_rank(paragraphs)
54
+ top_features = cluster_and_rank(features)
55
+
56
+ return {
57
+ "claims": rank_by_centrality(top_claims),
58
+ "paragraphs": rank_by_centrality(top_paragraphs),
59
+ "features": rank_by_centrality(top_features),
60
+ }
61
+
62
+ def load_json_file(file_path):
63
+ """Load JSON data from a file"""
64
+ with open(file_path, 'r') as f:
65
+ return json.load(f)
66
+
67
+ def save_json_file(data, file_path):
68
+ """Save data to a JSON file"""
69
+ with open(file_path, 'w') as f:
70
+ json.dump(data, f, indent=2)
71
+
72
+ def load_content_data(file_path):
73
+ """Load content data from a JSON file"""
74
+ with open(file_path, 'r') as f:
75
+ data = json.load(f)
76
+
77
+ # Create a dictionary mapping FAN to Content
78
+ content_dict = {item['FAN']: item['Content'] for item in data}
79
+ return content_dict
80
+
81
+ def extract_text(content_dict, text_type="full"):
82
+ """Extract text from patent content based on text_type"""
83
+ if text_type == "TA" or text_type == "title_abstract":
84
+ # Extract title and abstract
85
+ title = content_dict.get("title", "")
86
+ abstract = content_dict.get("pa01", "")
87
+ return f"{title} {abstract}".strip()
88
+
89
+ elif text_type == "claims":
90
+ # Extract all claims (keys starting with 'c')
91
+ claims = []
92
+ for key, value in content_dict.items():
93
+ if key.startswith('c-'):
94
+ claims.append(value)
95
+ return " ".join(claims)
96
+
97
+ elif text_type == "tac1":
98
+ # Extract title, abstract, and first claim
99
+ title = content_dict.get("title", "")
100
+ abstract = content_dict.get("pa01", "")
101
+ # Find the first claim safely
102
+ first_claim = ""
103
+ for key, value in content_dict.items():
104
+ if key.startswith('c-'):
105
+ first_claim = value
106
+ break
107
+ return f"{title} {abstract} {first_claim}".strip()
108
+
109
+ elif text_type == "description":
110
+ # Extract all paragraphs (keys starting with 'p')
111
+ paragraphs = []
112
+ for key, value in content_dict.items():
113
+ if key.startswith('p'):
114
+ paragraphs.append(value)
115
+ return " ".join(paragraphs)
116
+
117
+ elif text_type == "full":
118
+ # Extract everything
119
+ all_text = []
120
+ # Start with title and abstract for better context at the beginning
121
+ if "title" in content_dict:
122
+ all_text.append(content_dict["title"])
123
+ if "pa01" in content_dict:
124
+ all_text.append(content_dict["pa01"])
125
+
126
+ # Add claims and description
127
+ for key, value in content_dict.items():
128
+ if key != "title" and key != "pa01":
129
+ all_text.append(value)
130
+
131
+ return " ".join(all_text)
132
+
133
+ elif text_type == "smart":
134
+ filtered_dict = process_single_patent(content_dict)
135
+ all_text = []
136
+ # Start with abstract for better context at the beginning
137
+ if "pa01" in content_dict:
138
+ all_text.append(content_dict["pa01"])
139
+
140
+ # For claims, paragraphs and features, we take only the top-10 most relevant
141
+ # Add claims
142
+ for claim in filtered_dict["claims"][:10]:
143
+ all_text.append(claim)
144
+ # Add paragraphs
145
+ for paragraph in filtered_dict["paragraphs"][:10]:
146
+ all_text.append(paragraph)
147
+ # Add features
148
+ for feature in filtered_dict["features"][:10]:
149
+ all_text.append(feature)
150
+
151
+ return " ".join(all_text)
152
+
153
+
154
+ return ""
155
+
156
+ def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
157
+ """Extract the last token representations for pooling"""
158
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
159
+ if left_padding:
160
+ return last_hidden_states[:, -1]
161
+ else:
162
+ sequence_lengths = attention_mask.sum(dim=1) - 1
163
+ batch_size = last_hidden_states.shape[0]
164
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
165
+
166
+ def get_detailed_instruct(task_description: str, query: str) -> str:
167
+ """Create an instruction-formatted query"""
168
+ return f'Instruct: {task_description}\nQuery: {query}'
169
+
170
+ def cross_encoder_reranking(query_text, doc_texts, model, tokenizer, batch_size=8, max_length=2048):
171
+ """
172
+ Rerank document texts based on query text using cross-encoder model
173
+
174
+ Parameters:
175
+ query_text (str): The query text
176
+ doc_texts (list): List of document texts
177
+ model: The cross-encoder model
178
+ tokenizer: The tokenizer for the model
179
+ batch_size (int): Batch size for processing
180
+ max_length (int): Maximum sequence length
181
+
182
+ Returns:
183
+ list: Indices of documents sorted by relevance score (descending)
184
+ """
185
+ device = next(model.parameters()).device
186
+ scores = []
187
+
188
+ # Format query with instruction
189
+ task_description = 'Re-rank a set of retrieved patents based on their relevance to a given query patent. The task aims to refine the order of patents by evaluating their semantic similarity to the query patent, ensuring that the most relevant patents appear at the top of the list.'
190
+
191
+ instructed_query = get_detailed_instruct(task_description, query_text)
192
+
193
+ # Process in batches to avoid OOM
194
+ for i in tqdm(range(0, len(doc_texts), batch_size), desc="Scoring documents", leave=False):
195
+ batch_docs = doc_texts[i:i+batch_size]
196
+
197
+ # Prepare input pairs for the batch
198
+ input_texts = [instructed_query] + batch_docs
199
+
200
+ # Tokenize
201
+ with torch.no_grad():
202
+ batch_dict = tokenizer(input_texts, max_length=max_length, padding=True,
203
+ truncation=True, return_tensors='pt').to(device)
204
+
205
+ # Get embeddings
206
+ outputs = model(**batch_dict)
207
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
208
+
209
+ # Normalize embeddings
210
+ embeddings = F.normalize(embeddings, p=2, dim=1)
211
+
212
+ # Calculate similarity scores between query and documents
213
+ batch_scores = (embeddings[0].unsqueeze(0) @ embeddings[1:].T).squeeze(0) * 100
214
+ scores.extend(batch_scores.cpu().tolist())
215
+
216
+ # Create list of (index, score) tuples for sorting
217
+ indexed_scores = list(enumerate(scores))
218
+
219
+ # Sort by score in descending order
220
+ indexed_scores.sort(key=lambda x: x[1], reverse=True)
221
+
222
+ # Return sorted indices
223
+ return [idx for idx, _ in indexed_scores]
224
+
225
+ def main():
226
+ parser = argparse.ArgumentParser(description='Re-rank patents using cross-encoder scoring (training queries only)')
227
+ parser.add_argument('--pre_ranking', type=str, default='shuffled_pre_ranking.json',
228
+ help='Path to pre-ranking JSON file')
229
+ parser.add_argument('--output', type=str, default='predictions2.json',
230
+ help='Path to output re-ranked JSON file')
231
+ parser.add_argument('--queries_content', type=str,
232
+ default='./queries_content_with_features.json',
233
+ help='Path to queries content JSON file')
234
+ parser.add_argument('--documents_content', type=str,
235
+ default='./documents_content_with_features.json',
236
+ help='Path to documents content JSON file')
237
+ parser.add_argument('--queries_list', type=str, default='train_queries.json',
238
+ help='Path to training queries JSON file')
239
+ parser.add_argument('--text_type', type=str, default='TA',
240
+ choices=['TA', 'claims', 'description', 'full', 'tac1', 'smart'],
241
+ help='Type of text to use for scoring')
242
+ parser.add_argument('--model_name', type=str, default='intfloat/e5-large-v2',
243
+ help='Name of the cross-encoder model')
244
+ parser.add_argument('--batch_size', type=int, default=4,
245
+ help='Batch size for scoring')
246
+ parser.add_argument('--max_length', type=int, default=512,
247
+ help='Maximum sequence length')
248
+ parser.add_argument('--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu',
249
+ help='Device to use (cuda/cpu)')
250
+ parser.add_argument('--base_dir', type=str,
251
+ default='datasets',
252
+ help='Base directory for data files')
253
+
254
+ args = parser.parse_args()
255
+
256
+ # Ensure all paths are relative to base_dir if they're not absolute
257
+ def get_full_path(path):
258
+ if os.path.isabs(path):
259
+ return path
260
+ return os.path.join(args.base_dir, path)
261
+
262
+ # Load training queries
263
+ print(f"Loading training queries from {args.queries_list}...")
264
+ queries_list = load_json_file(get_full_path(args.queries_list))
265
+ print(f"Loaded {len(queries_list)} training queries")
266
+
267
+ # Load pre-ranking data
268
+ print(f"Loading pre-ranking data from {args.pre_ranking}...")
269
+ pre_ranking = load_json_file(get_full_path(args.pre_ranking))
270
+
271
+ # Filter pre-ranking to include only training queries
272
+ pre_ranking = {fan: docs for fan, docs in pre_ranking.items() if fan in queries_list}
273
+ print(f"Filtered pre-ranking to {len(pre_ranking)} training queries")
274
+
275
+ # Load content data
276
+ print(f"Loading query content from {args.queries_content}...")
277
+ queries_content = load_content_data(get_full_path(args.queries_content))
278
+
279
+ print(f"Loading document content from {args.documents_content}...")
280
+ documents_content = load_content_data(get_full_path(args.documents_content))
281
+
282
+ # Load model and tokenizer
283
+ print(f"Loading model {args.model_name}...")
284
+ tokenizer = AutoTokenizer.from_pretrained(args.model_name)
285
+ model = AutoModel.from_pretrained(args.model_name).to(args.device)
286
+ model.eval()
287
+
288
+ # Process each query and re-rank its documents
289
+ print("Starting re-ranking process for training queries...")
290
+ re_ranked = {}
291
+ missing_query_fans = []
292
+ missing_doc_fans = {}
293
+
294
+
295
+ for query_fan, pre_ranked_docs in tqdm(pre_ranking.items(), desc="Processing queries"):
296
+ # Check if query FAN exists in our content data
297
+ if query_fan not in queries_content:
298
+ missing_query_fans.append(query_fan)
299
+ continue
300
+
301
+ # Extract query text
302
+ query_text = extract_text(queries_content[query_fan], args.text_type)
303
+ if not query_text:
304
+ missing_query_fans.append(query_fan)
305
+ continue
306
+
307
+ # Prepare document texts and keep track of their fans
308
+ doc_texts = []
309
+ doc_fans = []
310
+ missing_docs_for_query = []
311
+
312
+ for doc_fan in pre_ranked_docs:
313
+ if doc_fan not in documents_content:
314
+ missing_docs_for_query.append(doc_fan)
315
+ continue
316
+
317
+ doc_text = extract_text(documents_content[doc_fan], args.text_type)
318
+ if doc_text:
319
+ doc_texts.append(doc_text)
320
+ doc_fans.append(doc_fan)
321
+
322
+ # Keep track of missing documents
323
+ if missing_docs_for_query:
324
+ missing_doc_fans[query_fan] = missing_docs_for_query
325
+
326
+ # Skip if no valid documents
327
+ if not doc_texts:
328
+ re_ranked[query_fan] = []
329
+ continue
330
+
331
+ # Re-rank documents
332
+ print(f"\nRe-ranking {len(doc_texts)} documents for training query {query_fan}")
333
+
334
+ # Print some of the original pre-ranking order for debugging
335
+ print(f"Original pre-ranking (first 3): {doc_fans[:3]}")
336
+
337
+ # Use cross-encoder model for reranking
338
+ sorted_indices = cross_encoder_reranking(
339
+ query_text, doc_texts, model, tokenizer,
340
+ batch_size=args.batch_size, max_length=args.max_length
341
+ )
342
+ re_ranked[query_fan] = [doc_fans[i] for i in sorted_indices]
343
+
344
+ # Report any missing FANs
345
+ if missing_query_fans:
346
+ print(f"Warning: {len(missing_query_fans)} query FANs were not found in the content data")
347
+ if missing_doc_fans:
348
+ total_missing = sum(len(docs) for docs in missing_doc_fans.values())
349
+ print(f"Warning: {total_missing} document FANs were not found in the content data")
350
+
351
+ # Save re-ranked results
352
+ output_path = get_full_path(args.output)
353
+ print(f"Saving re-ranked results to {output_path}...")
354
+ save_json_file(re_ranked, output_path)
355
+
356
+ print("Re-ranking complete!")
357
+ print(f"Number of training queries processed: {len(re_ranked)}")
358
+
359
+ # Optionally save the missing FANs information for debugging
360
+ if missing_query_fans or missing_doc_fans:
361
+ missing_info = {
362
+ "missing_query_fans": missing_query_fans,
363
+ "missing_doc_fans": missing_doc_fans
364
+ }
365
+ missing_info_path = f"{os.path.splitext(output_path)[0]}_missing_fans.json"
366
+ save_json_file(missing_info, missing_info_path)
367
+ print(f"Information about missing FANs saved to {missing_info_path}")
368
+
369
+ if __name__ == "__main__":
370
+ main()
datasets/documents_content_with_features.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fcd453767cc0510e362087256cd4655c7d2c911dca351d9837ce76b89bc94ee1
3
+ size 68483618
datasets/predictions2.json ADDED
@@ -0,0 +1,642 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "79314580": [
3
+ "90742932",
4
+ "84307890",
5
+ "72116262",
6
+ "72352339",
7
+ "92464359",
8
+ "43243494",
9
+ "99159171",
10
+ "87379239",
11
+ "84339687",
12
+ "43379960",
13
+ "91207030",
14
+ "87321987",
15
+ "93275449",
16
+ "97088252",
17
+ "100886158",
18
+ "107646251",
19
+ "68932616",
20
+ "77860027",
21
+ "31705225",
22
+ "95503744",
23
+ "74928830",
24
+ "43629694",
25
+ "45747223",
26
+ "68476789",
27
+ "71185778",
28
+ "1360767",
29
+ "74251396",
30
+ "1692313",
31
+ "93620723",
32
+ "84816716"
33
+ ],
34
+ "78061231": [
35
+ "82741321",
36
+ "45789888",
37
+ "44264550",
38
+ "87988246",
39
+ "82567288",
40
+ "17816910",
41
+ "44341743",
42
+ "80564076",
43
+ "14904223",
44
+ "61241196",
45
+ "5179001",
46
+ "75763885",
47
+ "45066507",
48
+ "43895830",
49
+ "69551594",
50
+ "87010037",
51
+ "66645142",
52
+ "105616053",
53
+ "90048840",
54
+ "44165603",
55
+ "44396898",
56
+ "5276096",
57
+ "1881065",
58
+ "84776129",
59
+ "5355298",
60
+ "44430674",
61
+ "103861986",
62
+ "93502001",
63
+ "33466561",
64
+ "72681493"
65
+ ],
66
+ "66336898": [
67
+ "66336824",
68
+ "34249922",
69
+ "44193822",
70
+ "806116",
71
+ "74937",
72
+ "62201555",
73
+ "45975516",
74
+ "67250442",
75
+ "71716600",
76
+ "43894634",
77
+ "68442322",
78
+ "75066704",
79
+ "44399553",
80
+ "7881264",
81
+ "7058578",
82
+ "85834586",
83
+ "71288419",
84
+ "76963351",
85
+ "45623535",
86
+ "65451695",
87
+ "75750109",
88
+ "94003570",
89
+ "44468240",
90
+ "32098459",
91
+ "43507576",
92
+ "95717994",
93
+ "82046827",
94
+ "1744101",
95
+ "93736224",
96
+ "77857414"
97
+ ],
98
+ "105235513": [
99
+ "97452767",
100
+ "86303585",
101
+ "73495202",
102
+ "100476640",
103
+ "43628135",
104
+ "6388780",
105
+ "88396353",
106
+ "45557281",
107
+ "1715585",
108
+ "99616608",
109
+ "86007260",
110
+ "103610533",
111
+ "43536913",
112
+ "68577226",
113
+ "62031984",
114
+ "6423024",
115
+ "75036232",
116
+ "89387100",
117
+ "43885294",
118
+ "75948573",
119
+ "91500742",
120
+ "108870888",
121
+ "62287175",
122
+ "60048046",
123
+ "85370361",
124
+ "73104347",
125
+ "36635271",
126
+ "73572124",
127
+ "79756434",
128
+ "68290564"
129
+ ],
130
+ "77017157": [
131
+ "44375254",
132
+ "99285210",
133
+ "87988242",
134
+ "86238540",
135
+ "43403046",
136
+ "44690402",
137
+ "86190175",
138
+ "60008491",
139
+ "43345792",
140
+ "68534353",
141
+ "5358258",
142
+ "84024189",
143
+ "61946161",
144
+ "44863430",
145
+ "61244827",
146
+ "106520041",
147
+ "92017056",
148
+ "90445221",
149
+ "76112334",
150
+ "44359627",
151
+ "7104341",
152
+ "81094405",
153
+ "88274986",
154
+ "43728980",
155
+ "43605816",
156
+ "22858132",
157
+ "71005806",
158
+ "1737723",
159
+ "76191452",
160
+ "109532698"
161
+ ],
162
+ "106232152": [
163
+ "2719507",
164
+ "67854373",
165
+ "72113421",
166
+ "71547271",
167
+ "73350347",
168
+ "67610829",
169
+ "65038124",
170
+ "103770025",
171
+ "44527852",
172
+ "73995375",
173
+ "75319265",
174
+ "6397305",
175
+ "73903522",
176
+ "86342428",
177
+ "67698815",
178
+ "61938387",
179
+ "1923101",
180
+ "43867066",
181
+ "106667710",
182
+ "83331919",
183
+ "64464686",
184
+ "106895832",
185
+ "45808372",
186
+ "8079418",
187
+ "80564043",
188
+ "61119482",
189
+ "84005413",
190
+ "79986739",
191
+ "85472986",
192
+ "43810028"
193
+ ],
194
+ "76353708": [
195
+ "67682754",
196
+ "109084253",
197
+ "71320166",
198
+ "87799884",
199
+ "79249992",
200
+ "43917872",
201
+ "70492735",
202
+ "43668360",
203
+ "68498995",
204
+ "66038043",
205
+ "95781568",
206
+ "87407728",
207
+ "87854266",
208
+ "44406914",
209
+ "45823324",
210
+ "8011022",
211
+ "44295009",
212
+ "68036958",
213
+ "43591520",
214
+ "7670536",
215
+ "88303745",
216
+ "87707840",
217
+ "32931094",
218
+ "6881047",
219
+ "59339688",
220
+ "75731982",
221
+ "44369885",
222
+ "73095247",
223
+ "22421913",
224
+ "60626819"
225
+ ],
226
+ "79741091": [
227
+ "32607171",
228
+ "22769689",
229
+ "32370299",
230
+ "32660850",
231
+ "46019838",
232
+ "72004503",
233
+ "67967480",
234
+ "84813687",
235
+ "33434123",
236
+ "90649732",
237
+ "80783830",
238
+ "7842236",
239
+ "71187327",
240
+ "81276441",
241
+ "66130772",
242
+ "82420795",
243
+ "90056913",
244
+ "77861056",
245
+ "28976620",
246
+ "33300848",
247
+ "92514878",
248
+ "45939997",
249
+ "33374965",
250
+ "86456573",
251
+ "7610722",
252
+ "33531861",
253
+ "13993026",
254
+ "22225138",
255
+ "43614108",
256
+ "52035153"
257
+ ],
258
+ "86860100": [
259
+ "86860100",
260
+ "81692608",
261
+ "73365388",
262
+ "73843030",
263
+ "96849928",
264
+ "61287850",
265
+ "76760721",
266
+ "80708349",
267
+ "81389608",
268
+ "79098866",
269
+ "80775708",
270
+ "45496451",
271
+ "80710567",
272
+ "96208625",
273
+ "76958369",
274
+ "86258263",
275
+ "76279856",
276
+ "73313390",
277
+ "76127196",
278
+ "96379124",
279
+ "66478420",
280
+ "59567985",
281
+ "89793282",
282
+ "92821182",
283
+ "98396291",
284
+ "74805580",
285
+ "45448810",
286
+ "61233401",
287
+ "96681288",
288
+ "88179467"
289
+ ],
290
+ "79045164": [
291
+ "83843480",
292
+ "43156809",
293
+ "93073478",
294
+ "72741864",
295
+ "32999955",
296
+ "80772296",
297
+ "84338721",
298
+ "22415814",
299
+ "69217265",
300
+ "88080895",
301
+ "14870284",
302
+ "32458611",
303
+ "5083848",
304
+ "84605013",
305
+ "29184345",
306
+ "80753707",
307
+ "32511812",
308
+ "31628753",
309
+ "68627132",
310
+ "71610617",
311
+ "61304760",
312
+ "32459382",
313
+ "95191554",
314
+ "21116140",
315
+ "31663374",
316
+ "72450581",
317
+ "21912599",
318
+ "65179742",
319
+ "33051108",
320
+ "33352117"
321
+ ],
322
+ "85176564": [
323
+ "72443837",
324
+ "79000606",
325
+ "77527141",
326
+ "45199654",
327
+ "81692374",
328
+ "73833726",
329
+ "100297053",
330
+ "88192728",
331
+ "87866933",
332
+ "80686648",
333
+ "32416570",
334
+ "79971987",
335
+ "67147862",
336
+ "70911248",
337
+ "80770876",
338
+ "69065347",
339
+ "96680699",
340
+ "7279536",
341
+ "74791199",
342
+ "89963551",
343
+ "77926055",
344
+ "33202790",
345
+ "72169870",
346
+ "45096562",
347
+ "21344722",
348
+ "86043344",
349
+ "90367775",
350
+ "71782269",
351
+ "21927550",
352
+ "7623115"
353
+ ],
354
+ "74598812": [
355
+ "74598812",
356
+ "79540926",
357
+ "67307419",
358
+ "96695878",
359
+ "71716027",
360
+ "44524425",
361
+ "87116173",
362
+ "78616844",
363
+ "72840131",
364
+ "91960790",
365
+ "97024444",
366
+ "44961149",
367
+ "94552652",
368
+ "44520632",
369
+ "95992945",
370
+ "94439793",
371
+ "79335919",
372
+ "87894878",
373
+ "86118184",
374
+ "13982549",
375
+ "85638842",
376
+ "85933200",
377
+ "97867456",
378
+ "99247566",
379
+ "88364610",
380
+ "89362509",
381
+ "77491493",
382
+ "93554574",
383
+ "89131365",
384
+ "4974218"
385
+ ],
386
+ "78286001": [
387
+ "65432068",
388
+ "81715794",
389
+ "72703550",
390
+ "87559850",
391
+ "74552790",
392
+ "90037048",
393
+ "71872009",
394
+ "92628690",
395
+ "86035156",
396
+ "64948806",
397
+ "68821941",
398
+ "15060309",
399
+ "82906678",
400
+ "86335187",
401
+ "68865940",
402
+ "21833368",
403
+ "81622078",
404
+ "7950858",
405
+ "8046245",
406
+ "72460497",
407
+ "83140017",
408
+ "85700905",
409
+ "61290118",
410
+ "66400878",
411
+ "59354384",
412
+ "22675326",
413
+ "85969473",
414
+ "90271766",
415
+ "69569655",
416
+ "86430033"
417
+ ],
418
+ "79098180": [
419
+ "86237859",
420
+ "86696791",
421
+ "86301854",
422
+ "84416764",
423
+ "69328996",
424
+ "81822743",
425
+ "78288530",
426
+ "1376881",
427
+ "70371603",
428
+ "75998807",
429
+ "18158001",
430
+ "86536566",
431
+ "7345574",
432
+ "42810307",
433
+ "44417984",
434
+ "4169775",
435
+ "87437516",
436
+ "60044376",
437
+ "43094034",
438
+ "42365601",
439
+ "66025864",
440
+ "1354473",
441
+ "43878837",
442
+ "42886784",
443
+ "103929967",
444
+ "45901317",
445
+ "68722856",
446
+ "84117280",
447
+ "82465446",
448
+ "4082413"
449
+ ],
450
+ "78090091": [
451
+ "45307524",
452
+ "43533791",
453
+ "44002297",
454
+ "44230813",
455
+ "43830101",
456
+ "44448545",
457
+ "59568578",
458
+ "67700598",
459
+ "69615854",
460
+ "73425457",
461
+ "72990387",
462
+ "43642277",
463
+ "43055440",
464
+ "74340036",
465
+ "67448176",
466
+ "98558258",
467
+ "44302706",
468
+ "94631184",
469
+ "43331251",
470
+ "80640194",
471
+ "15004529",
472
+ "102506777",
473
+ "7602987",
474
+ "106235322",
475
+ "92389110",
476
+ "43912169",
477
+ "61270623",
478
+ "93720691",
479
+ "43473559",
480
+ "1250801"
481
+ ],
482
+ "80155730": [
483
+ "80155730",
484
+ "4166228",
485
+ "44526736",
486
+ "7552728",
487
+ "74507997",
488
+ "66361815",
489
+ "71124588",
490
+ "75237231",
491
+ "44159656",
492
+ "65179536",
493
+ "66361826",
494
+ "67260382",
495
+ "45097472",
496
+ "96208467",
497
+ "81705685",
498
+ "75606608",
499
+ "73532504",
500
+ "67610635",
501
+ "75057499",
502
+ "61899721",
503
+ "79832819",
504
+ "8064990",
505
+ "83115863",
506
+ "87397919",
507
+ "66396259",
508
+ "95909144",
509
+ "92933079",
510
+ "32691430",
511
+ "84161041",
512
+ "78012216"
513
+ ],
514
+ "76109734": [
515
+ "85385104",
516
+ "43765404",
517
+ "77836411",
518
+ "92130365",
519
+ "74594424",
520
+ "106884384",
521
+ "45448491",
522
+ "43710765",
523
+ "95594159",
524
+ "44495876",
525
+ "98208043",
526
+ "87667801",
527
+ "7303675",
528
+ "74580642",
529
+ "87811160",
530
+ "92416430",
531
+ "43804973",
532
+ "91962087",
533
+ "71546239",
534
+ "91897986",
535
+ "67177917",
536
+ "67261744",
537
+ "86218698",
538
+ "88597870",
539
+ "32299928",
540
+ "7018871",
541
+ "84687916",
542
+ "85523256",
543
+ "86020088",
544
+ "89980510"
545
+ ],
546
+ "106318129": [
547
+ "81915722",
548
+ "71772339",
549
+ "87092702",
550
+ "83106147",
551
+ "73657004",
552
+ "43618418",
553
+ "33159460",
554
+ "44586128",
555
+ "75798279",
556
+ "72024688",
557
+ "22135206",
558
+ "43776935",
559
+ "22859958",
560
+ "22666908",
561
+ "83266142",
562
+ "99268783",
563
+ "81302611",
564
+ "98200590",
565
+ "86236123",
566
+ "1772511",
567
+ "93406272",
568
+ "1882557",
569
+ "88406090",
570
+ "84038640",
571
+ "33448504",
572
+ "84797878",
573
+ "6846568",
574
+ "7077099",
575
+ "6706486",
576
+ "5134678"
577
+ ],
578
+ "1864211": [
579
+ "1864211",
580
+ "72442770",
581
+ "8059518",
582
+ "82586094",
583
+ "89411840",
584
+ "14904274",
585
+ "68146754",
586
+ "43490211",
587
+ "1694333",
588
+ "43694603",
589
+ "92166028",
590
+ "72958082",
591
+ "4119507",
592
+ "44553956",
593
+ "96329497",
594
+ "44286898",
595
+ "73417734",
596
+ "87324486",
597
+ "43711365",
598
+ "62074363",
599
+ "71119422",
600
+ "18233366",
601
+ "42841161",
602
+ "43822575",
603
+ "43639144",
604
+ "74893664",
605
+ "43955312",
606
+ "3986038",
607
+ "1613231",
608
+ "21083950"
609
+ ],
610
+ "61277994": [
611
+ "71858872",
612
+ "957307",
613
+ "61989058",
614
+ "44408142",
615
+ "43824316",
616
+ "44346272",
617
+ "61243783",
618
+ "95090315",
619
+ "94883937",
620
+ "62081642",
621
+ "76435367",
622
+ "110878972",
623
+ "42463440",
624
+ "75763896",
625
+ "87721730",
626
+ "6480562",
627
+ "66483344",
628
+ "43318215",
629
+ "7545380",
630
+ "86932488",
631
+ "82532105",
632
+ "44432202",
633
+ "77734762",
634
+ "84184048",
635
+ "82777643",
636
+ "75041109",
637
+ "44355159",
638
+ "76149716",
639
+ "13996136",
640
+ "22195474"
641
+ ]
642
+ }
datasets/queries_content_with_features.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/shuffled_pre_ranking.json ADDED
@@ -0,0 +1,962 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "103964109": [
3
+ "94596291",
4
+ "65451984",
5
+ "81098918",
6
+ "86686331",
7
+ "82807300",
8
+ "74999904",
9
+ "44437432",
10
+ "70494531",
11
+ "89655285",
12
+ "94546339",
13
+ "84923580",
14
+ "110338873",
15
+ "1662314",
16
+ "87285519",
17
+ "112489610",
18
+ "93007218",
19
+ "74364787",
20
+ "93196199",
21
+ "91358966",
22
+ "73189654",
23
+ "91801222",
24
+ "85915967",
25
+ "102035322",
26
+ "96138054",
27
+ "87488738",
28
+ "101974338",
29
+ "104761777",
30
+ "101598636",
31
+ "105078785",
32
+ "92631163"
33
+ ],
34
+ "79314580": [
35
+ "99159171",
36
+ "95503744",
37
+ "77860027",
38
+ "93275449",
39
+ "43379960",
40
+ "100886158",
41
+ "97088252",
42
+ "71185778",
43
+ "72352339",
44
+ "74251396",
45
+ "1692313",
46
+ "87379239",
47
+ "1360767",
48
+ "31705225",
49
+ "92464359",
50
+ "72116262",
51
+ "90742932",
52
+ "84816716",
53
+ "93620723",
54
+ "107646251",
55
+ "84339687",
56
+ "43243494",
57
+ "87321987",
58
+ "84307890",
59
+ "68932616",
60
+ "74928830",
61
+ "91207030",
62
+ "68476789",
63
+ "45747223",
64
+ "43629694"
65
+ ],
66
+ "78061231": [
67
+ "72681493",
68
+ "105616053",
69
+ "66645142",
70
+ "44264550",
71
+ "61241196",
72
+ "69551594",
73
+ "44165603",
74
+ "44341743",
75
+ "87010037",
76
+ "17816910",
77
+ "33466561",
78
+ "75763885",
79
+ "14904223",
80
+ "5355298",
81
+ "82741321",
82
+ "5179001",
83
+ "1881065",
84
+ "93502001",
85
+ "5276096",
86
+ "80564076",
87
+ "43895830",
88
+ "103861986",
89
+ "82567288",
90
+ "90048840",
91
+ "84776129",
92
+ "44430674",
93
+ "45789888",
94
+ "45066507",
95
+ "44396898",
96
+ "87988246"
97
+ ],
98
+ "72214279": [
99
+ "4112132",
100
+ "7126783",
101
+ "66704221",
102
+ "77750612",
103
+ "97541974",
104
+ "1024924",
105
+ "73313951",
106
+ "70443794",
107
+ "66859488",
108
+ "7024647",
109
+ "7186024",
110
+ "7323805",
111
+ "74719750",
112
+ "78285896",
113
+ "76635780",
114
+ "78874284",
115
+ "80491195",
116
+ "66704376",
117
+ "7409445",
118
+ "61941302",
119
+ "95087186",
120
+ "62204608",
121
+ "7185266",
122
+ "95793433",
123
+ "73575071",
124
+ "5188978",
125
+ "86201489",
126
+ "43855508",
127
+ "6786492",
128
+ "1052098"
129
+ ],
130
+ "68249923": [
131
+ "35246469",
132
+ "80619246",
133
+ "7191590",
134
+ "7503954",
135
+ "66204675",
136
+ "96103327",
137
+ "84151582",
138
+ "94133702",
139
+ "22791064",
140
+ "17811615",
141
+ "22538180",
142
+ "44330428",
143
+ "84067048",
144
+ "66785381",
145
+ "89264734",
146
+ "1880329",
147
+ "80790764",
148
+ "35300504",
149
+ "68984248",
150
+ "92458890",
151
+ "89313603",
152
+ "86319625",
153
+ "13900435",
154
+ "4811621",
155
+ "95326577",
156
+ "45407719",
157
+ "4102206",
158
+ "86874453",
159
+ "75001201",
160
+ "43783883"
161
+ ],
162
+ "66336898": [
163
+ "7881264",
164
+ "71288419",
165
+ "44193822",
166
+ "67250442",
167
+ "76963351",
168
+ "82046827",
169
+ "77857414",
170
+ "74937",
171
+ "44468240",
172
+ "45975516",
173
+ "71716600",
174
+ "45623535",
175
+ "85834586",
176
+ "94003570",
177
+ "1744101",
178
+ "806116",
179
+ "7058578",
180
+ "43894634",
181
+ "68442322",
182
+ "62201555",
183
+ "95717994",
184
+ "43507576",
185
+ "75066704",
186
+ "44399553",
187
+ "66336824",
188
+ "65451695",
189
+ "93736224",
190
+ "75750109",
191
+ "32098459",
192
+ "34249922"
193
+ ],
194
+ "105235513": [
195
+ "99616608",
196
+ "6423024",
197
+ "91500742",
198
+ "60048046",
199
+ "1715585",
200
+ "108870888",
201
+ "68577226",
202
+ "43628135",
203
+ "45557281",
204
+ "88396353",
205
+ "97452767",
206
+ "89387100",
207
+ "73495202",
208
+ "73572124",
209
+ "75948573",
210
+ "62031984",
211
+ "43536913",
212
+ "79756434",
213
+ "73104347",
214
+ "43885294",
215
+ "86007260",
216
+ "103610533",
217
+ "86303585",
218
+ "62287175",
219
+ "68290564",
220
+ "75036232",
221
+ "85370361",
222
+ "36635271",
223
+ "6388780",
224
+ "100476640"
225
+ ],
226
+ "79740635": [
227
+ "72282319",
228
+ "80126462",
229
+ "6507475",
230
+ "69278139",
231
+ "75024859",
232
+ "7937000",
233
+ "20176022",
234
+ "77144959",
235
+ "94502562",
236
+ "82301362",
237
+ "64815601",
238
+ "22271872",
239
+ "77324818",
240
+ "69050704",
241
+ "7582109",
242
+ "83374518",
243
+ "64764746",
244
+ "89202146",
245
+ "80781908",
246
+ "32251980",
247
+ "77783945",
248
+ "7332283",
249
+ "97072553",
250
+ "43110403",
251
+ "79096739",
252
+ "13938283",
253
+ "62141731",
254
+ "80463331",
255
+ "89202129",
256
+ "98938772"
257
+ ],
258
+ "100251983": [
259
+ "82028496",
260
+ "89295990",
261
+ "78872192",
262
+ "33473034",
263
+ "100515176",
264
+ "93007808",
265
+ "100251983",
266
+ "90670288",
267
+ "96091957",
268
+ "94669658",
269
+ "80167129",
270
+ "45282121",
271
+ "104436949",
272
+ "97292661",
273
+ "107325691",
274
+ "89672703",
275
+ "82206845",
276
+ "93331965",
277
+ "106996649",
278
+ "90490909",
279
+ "86073872",
280
+ "94029448",
281
+ "88392348",
282
+ "98600374",
283
+ "96059580",
284
+ "72140435",
285
+ "91560811",
286
+ "46598360",
287
+ "78547811",
288
+ "88463494"
289
+ ],
290
+ "77017157": [
291
+ "76191452",
292
+ "44359627",
293
+ "43728980",
294
+ "86238540",
295
+ "84024189",
296
+ "88274986",
297
+ "60008491",
298
+ "61244827",
299
+ "44375254",
300
+ "76112334",
301
+ "43403046",
302
+ "5358258",
303
+ "7104341",
304
+ "43345792",
305
+ "61946161",
306
+ "22858132",
307
+ "90445221",
308
+ "44690402",
309
+ "81094405",
310
+ "71005806",
311
+ "87988242",
312
+ "86190175",
313
+ "1737723",
314
+ "99285210",
315
+ "109532698",
316
+ "92017056",
317
+ "44863430",
318
+ "106520041",
319
+ "43605816",
320
+ "68534353"
321
+ ],
322
+ "106232152": [
323
+ "64464686",
324
+ "43867066",
325
+ "45808372",
326
+ "103770025",
327
+ "72113421",
328
+ "84005413",
329
+ "43810028",
330
+ "73350347",
331
+ "67854373",
332
+ "73995375",
333
+ "75319265",
334
+ "79986739",
335
+ "71547271",
336
+ "44527852",
337
+ "8079418",
338
+ "67610829",
339
+ "61119482",
340
+ "2719507",
341
+ "73903522",
342
+ "106667710",
343
+ "67698815",
344
+ "106895832",
345
+ "80564043",
346
+ "65038124",
347
+ "61938387",
348
+ "6397305",
349
+ "83331919",
350
+ "1923101",
351
+ "85472986",
352
+ "86342428"
353
+ ],
354
+ "76353708": [
355
+ "87707840",
356
+ "87854266",
357
+ "43591520",
358
+ "79249992",
359
+ "95781568",
360
+ "32931094",
361
+ "87407728",
362
+ "70492735",
363
+ "6881047",
364
+ "66038043",
365
+ "73095247",
366
+ "7670536",
367
+ "43917872",
368
+ "44295009",
369
+ "44369885",
370
+ "45823324",
371
+ "71320166",
372
+ "44406914",
373
+ "22421913",
374
+ "88303745",
375
+ "59339688",
376
+ "68036958",
377
+ "60626819",
378
+ "43668360",
379
+ "87799884",
380
+ "67682754",
381
+ "68498995",
382
+ "109084253",
383
+ "75731982",
384
+ "8011022"
385
+ ],
386
+ "79741091": [
387
+ "46019838",
388
+ "71187327",
389
+ "52035153",
390
+ "7842236",
391
+ "22769689",
392
+ "84813687",
393
+ "13993026",
394
+ "82420795",
395
+ "66130772",
396
+ "80783830",
397
+ "72004503",
398
+ "7610722",
399
+ "33374965",
400
+ "43614108",
401
+ "67967480",
402
+ "32660850",
403
+ "32607171",
404
+ "81276441",
405
+ "90056913",
406
+ "28976620",
407
+ "32370299",
408
+ "92514878",
409
+ "86456573",
410
+ "45939997",
411
+ "90649732",
412
+ "22225138",
413
+ "77861056",
414
+ "33300848",
415
+ "33531861",
416
+ "33434123"
417
+ ],
418
+ "86860100": [
419
+ "98396291",
420
+ "89793282",
421
+ "80708349",
422
+ "96681288",
423
+ "76279856",
424
+ "76127196",
425
+ "86258263",
426
+ "79098866",
427
+ "76958369",
428
+ "80775708",
429
+ "61233401",
430
+ "80710567",
431
+ "73365388",
432
+ "66478420",
433
+ "74805580",
434
+ "76760721",
435
+ "96208625",
436
+ "73313390",
437
+ "81692608",
438
+ "92821182",
439
+ "86860100",
440
+ "96849928",
441
+ "88179467",
442
+ "59567985",
443
+ "45448810",
444
+ "61287850",
445
+ "73843030",
446
+ "81389608",
447
+ "96379124",
448
+ "45496451"
449
+ ],
450
+ "79045164": [
451
+ "21116140",
452
+ "29184345",
453
+ "33352117",
454
+ "80753707",
455
+ "65179742",
456
+ "68627132",
457
+ "32999955",
458
+ "22415814",
459
+ "71610617",
460
+ "83843480",
461
+ "33051108",
462
+ "31628753",
463
+ "32511812",
464
+ "93073478",
465
+ "32459382",
466
+ "5083848",
467
+ "43156809",
468
+ "88080895",
469
+ "14870284",
470
+ "21912599",
471
+ "69217265",
472
+ "32458611",
473
+ "61304760",
474
+ "80772296",
475
+ "72450581",
476
+ "31663374",
477
+ "95191554",
478
+ "84605013",
479
+ "84338721",
480
+ "72741864"
481
+ ],
482
+ "85176564": [
483
+ "45096562",
484
+ "32416570",
485
+ "67147862",
486
+ "21344722",
487
+ "73833726",
488
+ "79000606",
489
+ "70911248",
490
+ "100297053",
491
+ "71782269",
492
+ "72443837",
493
+ "87866933",
494
+ "69065347",
495
+ "21927550",
496
+ "86043344",
497
+ "80686648",
498
+ "89963551",
499
+ "7279536",
500
+ "90367775",
501
+ "80770876",
502
+ "81692374",
503
+ "72169870",
504
+ "33202790",
505
+ "74791199",
506
+ "96680699",
507
+ "45199654",
508
+ "77527141",
509
+ "88192728",
510
+ "77926055",
511
+ "7623115",
512
+ "79971987"
513
+ ],
514
+ "74598812": [
515
+ "78616844",
516
+ "44961149",
517
+ "91960790",
518
+ "79540926",
519
+ "72840131",
520
+ "99247566",
521
+ "94552652",
522
+ "89362509",
523
+ "79335919",
524
+ "71716027",
525
+ "95992945",
526
+ "86118184",
527
+ "74598812",
528
+ "87116173",
529
+ "93554574",
530
+ "13982549",
531
+ "88364610",
532
+ "97024444",
533
+ "44520632",
534
+ "4974218",
535
+ "87894878",
536
+ "85933200",
537
+ "97867456",
538
+ "85638842",
539
+ "89131365",
540
+ "67307419",
541
+ "44524425",
542
+ "77491493",
543
+ "94439793",
544
+ "96695878"
545
+ ],
546
+ "76109416": [
547
+ "67929958",
548
+ "1212129",
549
+ "74871145",
550
+ "92580522",
551
+ "44149931",
552
+ "96952506",
553
+ "75226918",
554
+ "76109416",
555
+ "87097786",
556
+ "90897974",
557
+ "89081482",
558
+ "43237025",
559
+ "43845712",
560
+ "44268155",
561
+ "72024697",
562
+ "68498126",
563
+ "85956111",
564
+ "86120389",
565
+ "86772879",
566
+ "71010602",
567
+ "6825461",
568
+ "66358",
569
+ "86586319",
570
+ "45066980",
571
+ "79148717",
572
+ "1859328",
573
+ "83775044",
574
+ "98417973",
575
+ "83915124",
576
+ "84518942"
577
+ ],
578
+ "78286001": [
579
+ "86430033",
580
+ "69569655",
581
+ "87559850",
582
+ "90037048",
583
+ "7950858",
584
+ "59354384",
585
+ "8046245",
586
+ "68821941",
587
+ "65432068",
588
+ "81715794",
589
+ "66400878",
590
+ "86035156",
591
+ "86335187",
592
+ "64948806",
593
+ "15060309",
594
+ "68865940",
595
+ "90271766",
596
+ "72460497",
597
+ "21833368",
598
+ "82906678",
599
+ "74552790",
600
+ "92628690",
601
+ "72703550",
602
+ "85969473",
603
+ "81622078",
604
+ "71872009",
605
+ "83140017",
606
+ "85700905",
607
+ "22675326",
608
+ "61290118"
609
+ ],
610
+ "79098180": [
611
+ "84117280",
612
+ "86237859",
613
+ "44417984",
614
+ "43094034",
615
+ "45901317",
616
+ "42886784",
617
+ "7345574",
618
+ "43878837",
619
+ "1354473",
620
+ "86301854",
621
+ "66025864",
622
+ "86696791",
623
+ "42365601",
624
+ "86536566",
625
+ "18158001",
626
+ "70371603",
627
+ "1376881",
628
+ "81822743",
629
+ "4082413",
630
+ "4169775",
631
+ "69328996",
632
+ "68722856",
633
+ "60044376",
634
+ "84416764",
635
+ "75998807",
636
+ "42810307",
637
+ "78288530",
638
+ "103929967",
639
+ "82465446",
640
+ "87437516"
641
+ ],
642
+ "85685768": [
643
+ "5399125",
644
+ "103242931",
645
+ "96992323",
646
+ "79386554",
647
+ "91467629",
648
+ "101540987",
649
+ "94616956",
650
+ "101485622",
651
+ "43287174",
652
+ "90415246",
653
+ "91796062",
654
+ "43124738",
655
+ "92906077",
656
+ "86711677",
657
+ "91034639",
658
+ "68036945",
659
+ "100252809",
660
+ "6716158",
661
+ "22854327",
662
+ "91230085",
663
+ "44105210",
664
+ "80775193",
665
+ "92933013",
666
+ "7874471",
667
+ "93463606",
668
+ "98140923",
669
+ "87659292",
670
+ "71483822",
671
+ "92322829",
672
+ "22926275"
673
+ ],
674
+ "78090091": [
675
+ "72990387",
676
+ "106235322",
677
+ "44002297",
678
+ "43533791",
679
+ "1250801",
680
+ "93720691",
681
+ "43473559",
682
+ "44448545",
683
+ "94631184",
684
+ "74340036",
685
+ "43912169",
686
+ "15004529",
687
+ "44230813",
688
+ "43830101",
689
+ "43642277",
690
+ "92389110",
691
+ "44302706",
692
+ "59568578",
693
+ "67448176",
694
+ "61270623",
695
+ "45307524",
696
+ "7602987",
697
+ "69615854",
698
+ "43331251",
699
+ "102506777",
700
+ "98558258",
701
+ "43055440",
702
+ "73425457",
703
+ "67700598",
704
+ "80640194"
705
+ ],
706
+ "80155730": [
707
+ "67610635",
708
+ "92933079",
709
+ "75057499",
710
+ "45097472",
711
+ "87397919",
712
+ "7552728",
713
+ "8064990",
714
+ "83115863",
715
+ "74507997",
716
+ "96208467",
717
+ "79832819",
718
+ "66361826",
719
+ "71124588",
720
+ "78012216",
721
+ "75237231",
722
+ "65179536",
723
+ "75606608",
724
+ "80155730",
725
+ "81705685",
726
+ "44159656",
727
+ "66361815",
728
+ "67260382",
729
+ "61899721",
730
+ "32691430",
731
+ "84161041",
732
+ "95909144",
733
+ "73532504",
734
+ "66396259",
735
+ "44526736",
736
+ "4166228"
737
+ ],
738
+ "70563808": [
739
+ "76578541",
740
+ "66791385",
741
+ "44519425",
742
+ "13902208",
743
+ "75429864",
744
+ "59356913",
745
+ "45525493",
746
+ "14687460",
747
+ "7305297",
748
+ "87318046",
749
+ "72133315",
750
+ "62199485",
751
+ "7445616",
752
+ "105795426",
753
+ "7728171",
754
+ "61287856",
755
+ "6383581",
756
+ "6936022",
757
+ "43891147",
758
+ "7706134",
759
+ "70563808",
760
+ "1651911",
761
+ "77540953",
762
+ "7807387",
763
+ "46288752",
764
+ "78164659",
765
+ "7949497",
766
+ "45039835",
767
+ "45760699",
768
+ "8125158"
769
+ ],
770
+ "79482665": [
771
+ "87571623",
772
+ "92503480",
773
+ "68914227",
774
+ "81582469",
775
+ "102635100",
776
+ "88449039",
777
+ "67370804",
778
+ "73293133",
779
+ "95908606",
780
+ "93592423",
781
+ "6903999",
782
+ "71022105",
783
+ "87300637",
784
+ "72279718",
785
+ "73497949",
786
+ "88456909",
787
+ "87107633",
788
+ "93318781",
789
+ "86402797",
790
+ "93524596",
791
+ "46019842",
792
+ "44666167",
793
+ "110233006",
794
+ "103621196",
795
+ "84540103",
796
+ "93279107",
797
+ "70930284",
798
+ "89030165",
799
+ "4095476",
800
+ "83083320"
801
+ ],
802
+ "76109734": [
803
+ "71546239",
804
+ "67177917",
805
+ "106884384",
806
+ "95594159",
807
+ "43804973",
808
+ "85385104",
809
+ "98208043",
810
+ "92416430",
811
+ "43765404",
812
+ "7303675",
813
+ "74580642",
814
+ "74594424",
815
+ "88597870",
816
+ "92130365",
817
+ "45448491",
818
+ "32299928",
819
+ "85523256",
820
+ "87667801",
821
+ "91897986",
822
+ "86218698",
823
+ "87811160",
824
+ "43710765",
825
+ "86020088",
826
+ "67261744",
827
+ "91962087",
828
+ "89980510",
829
+ "44495876",
830
+ "77836411",
831
+ "7018871",
832
+ "84687916"
833
+ ],
834
+ "106318129": [
835
+ "81915722",
836
+ "22135206",
837
+ "88406090",
838
+ "86236123",
839
+ "33159460",
840
+ "6706486",
841
+ "7077099",
842
+ "73657004",
843
+ "71772339",
844
+ "6846568",
845
+ "83266142",
846
+ "43618418",
847
+ "83106147",
848
+ "33448504",
849
+ "99268783",
850
+ "1772511",
851
+ "87092702",
852
+ "22666908",
853
+ "72024688",
854
+ "1882557",
855
+ "98200590",
856
+ "43776935",
857
+ "5134678",
858
+ "75798279",
859
+ "22859958",
860
+ "93406272",
861
+ "81302611",
862
+ "84038640",
863
+ "84797878",
864
+ "44586128"
865
+ ],
866
+ "1864211": [
867
+ "74893664",
868
+ "14904274",
869
+ "73417734",
870
+ "43490211",
871
+ "82586094",
872
+ "18233366",
873
+ "1694333",
874
+ "1613231",
875
+ "8059518",
876
+ "62074363",
877
+ "4119507",
878
+ "44286898",
879
+ "43822575",
880
+ "71119422",
881
+ "43694603",
882
+ "87324486",
883
+ "3986038",
884
+ "1864211",
885
+ "68146754",
886
+ "43711365",
887
+ "72958082",
888
+ "44553956",
889
+ "89411840",
890
+ "21083950",
891
+ "92166028",
892
+ "96329497",
893
+ "43955312",
894
+ "43639144",
895
+ "42841161",
896
+ "72442770"
897
+ ],
898
+ "75800075": [
899
+ "33464411",
900
+ "34284570",
901
+ "74966633",
902
+ "71238892",
903
+ "84214328",
904
+ "73305870",
905
+ "77197418",
906
+ "64972313",
907
+ "93085483",
908
+ "61870386",
909
+ "73750287",
910
+ "35215942",
911
+ "75692075",
912
+ "77144269",
913
+ "35197610",
914
+ "43878177",
915
+ "76120190",
916
+ "81692381",
917
+ "43687538",
918
+ "62288211",
919
+ "70999237",
920
+ "76825949",
921
+ "7588356",
922
+ "34173412",
923
+ "43796291",
924
+ "62194904",
925
+ "81704710",
926
+ "22823110",
927
+ "7228433",
928
+ "86183849"
929
+ ],
930
+ "61277994": [
931
+ "13996136",
932
+ "87721730",
933
+ "43824316",
934
+ "82532105",
935
+ "44408142",
936
+ "61989058",
937
+ "62081642",
938
+ "82777643",
939
+ "86932488",
940
+ "44355159",
941
+ "957307",
942
+ "43318215",
943
+ "61243783",
944
+ "110878972",
945
+ "66483344",
946
+ "76149716",
947
+ "22195474",
948
+ "95090315",
949
+ "76435367",
950
+ "44432202",
951
+ "6480562",
952
+ "84184048",
953
+ "7545380",
954
+ "44346272",
955
+ "75763896",
956
+ "94883937",
957
+ "75041109",
958
+ "42463440",
959
+ "71858872",
960
+ "77734762"
961
+ ]
962
+ }
datasets/test_queries.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ "76109416",
3
+ "75800075",
4
+ "68249923",
5
+ "79482665",
6
+ "79740635",
7
+ "100251983",
8
+ "70563808",
9
+ "103964109",
10
+ "72214279",
11
+ "85685768"
12
+ ]
datasets/train_gold_mapping.json ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "79314580": [
3
+ "43379960",
4
+ "68932616",
5
+ "43243494"
6
+ ],
7
+ "78061231": [
8
+ "44396898",
9
+ "45789888",
10
+ "44165603",
11
+ "44264550"
12
+ ],
13
+ "66336898": [
14
+ "32098459",
15
+ "44468240",
16
+ "74937",
17
+ "62201555",
18
+ "44193822",
19
+ "806116"
20
+ ],
21
+ "105235513": [
22
+ "85370361"
23
+ ],
24
+ "77017157": [
25
+ "61244827",
26
+ "71005806",
27
+ "61946161"
28
+ ],
29
+ "106232152": [
30
+ "84005413",
31
+ "73995375",
32
+ "45808372",
33
+ "86342428"
34
+ ],
35
+ "76353708": [
36
+ "71320166",
37
+ "59339688",
38
+ "44295009",
39
+ "22421913",
40
+ "32931094",
41
+ "60626819",
42
+ "44369885"
43
+ ],
44
+ "79741091": [
45
+ "33434123"
46
+ ],
47
+ "86860100": [
48
+ "66478420",
49
+ "76958369"
50
+ ],
51
+ "79045164": [
52
+ "31628753",
53
+ "72450581",
54
+ "5083848",
55
+ "33352117",
56
+ "72741864"
57
+ ],
58
+ "85176564": [
59
+ "69065347",
60
+ "70911248"
61
+ ],
62
+ "74598812": [
63
+ "72840131"
64
+ ],
65
+ "78286001": [
66
+ "72460497",
67
+ "68865940",
68
+ "68821941"
69
+ ],
70
+ "79098180": [
71
+ "1376881",
72
+ "68722856",
73
+ "7345574"
74
+ ],
75
+ "78090091": [
76
+ "43331251",
77
+ "72990387",
78
+ "44002297",
79
+ "43473559",
80
+ "44230813"
81
+ ],
82
+ "80155730": [
83
+ "67610635",
84
+ "66361815"
85
+ ],
86
+ "76109734": [
87
+ "67177917",
88
+ "67261744",
89
+ "43710765",
90
+ "32299928",
91
+ "43804973"
92
+ ],
93
+ "106318129": [
94
+ "88406090",
95
+ "22859958",
96
+ "44586128"
97
+ ],
98
+ "1864211": [
99
+ "44553956",
100
+ "14904274",
101
+ "21083950",
102
+ "1613231",
103
+ "43490211",
104
+ "43955312"
105
+ ],
106
+ "61277994": [
107
+ "44355159",
108
+ "44432202",
109
+ "957307"
110
+ ]
111
+ }
datasets/train_queries.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ "79098180",
3
+ "79045164",
4
+ "106232152",
5
+ "106318129",
6
+ "80155730",
7
+ "105235513",
8
+ "66336898",
9
+ "79741091",
10
+ "76353708",
11
+ "85176564",
12
+ "77017157",
13
+ "76109734",
14
+ "61277994",
15
+ "78090091",
16
+ "74598812",
17
+ "1864211",
18
+ "79314580",
19
+ "86860100",
20
+ "78286001",
21
+ "78061231"
22
+ ]
evaluate_train_rankings.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import argparse
4
+ import sys
5
+
6
+ # Import metrics directly from the local file
7
+ from metrics import mean_recall_at_k, mean_average_precision, mean_inv_ranking, mean_ranking
8
+
9
+ def load_json_file(file_path):
10
+ """Load JSON data from a file"""
11
+ with open(file_path, 'r') as f:
12
+ return json.load(f)
13
+
14
+ def main():
15
+ parser = argparse.ArgumentParser(description='Evaluate document ranking performance on training data')
16
+ parser.add_argument('--pre_ranking', type=str, default='shuffled_pre_ranking.json',
17
+ help='Path to pre-ranking JSON file')
18
+ parser.add_argument('--re_ranking', type=str, default='predictions2.json',
19
+ help='Path to re-ranked JSON file')
20
+ parser.add_argument('--gold', type=str, default='train_gold_mapping.json',
21
+ help='Path to gold standard mapping JSON file (training only)')
22
+ parser.add_argument('--train_queries', type=str, default='train_queries.json',
23
+ help='Path to training queries JSON file')
24
+ parser.add_argument('--k_values', type=str, default='3,5,10,20',
25
+ help='Comma-separated list of k values for Recall@k')
26
+ parser.add_argument('--base_dir', type=str,
27
+ default='datasets',
28
+ help='Base directory for data files')
29
+ args = parser.parse_args()
30
+
31
+ # Ensure all paths are relative to base_dir if they're not absolute
32
+ def get_full_path(path):
33
+ if os.path.isabs(path):
34
+ return path
35
+ return os.path.join(args.base_dir, path)
36
+
37
+ # Load the training queries
38
+ print("Loading training queries...")
39
+ train_queries = load_json_file(get_full_path(args.train_queries))
40
+ print(f"Loaded {len(train_queries)} training queries")
41
+
42
+ # Load the ranking data and gold standard
43
+ print("Loading ranking data and gold standard...")
44
+ pre_ranking = load_json_file(get_full_path(args.pre_ranking))
45
+ re_ranking = load_json_file(get_full_path(args.re_ranking))
46
+ gold_mapping = load_json_file(get_full_path(args.gold))
47
+
48
+ # Filter to include only training queries
49
+ pre_ranking = {fan: docs for fan, docs in pre_ranking.items() if fan in train_queries}
50
+ re_ranking = {fan: docs for fan, docs in re_ranking.items() if fan in train_queries} # Fixed this line
51
+ gold_mapping = {fan: docs for fan, docs in gold_mapping.items() if fan in train_queries}
52
+
53
+ # Parse k values
54
+ k_values = [int(k) for k in args.k_values.split(',')]
55
+
56
+ # Prepare data for metrics calculation
57
+ query_fans = set(gold_mapping.keys()) & set(pre_ranking.keys()) & set(re_ranking.keys())
58
+
59
+ if not query_fans:
60
+ print("Error: No common query FANs found across all datasets!")
61
+ return
62
+
63
+ print(f"Evaluating rankings for {len(query_fans)} training queries...")
64
+
65
+ # Extract true and predicted labels for both rankings
66
+ true_labels = [gold_mapping[fan] for fan in query_fans]
67
+ pre_ranking_labels = [pre_ranking[fan] for fan in query_fans]
68
+ re_ranking_labels = [re_ranking[fan] for fan in query_fans]
69
+
70
+ # Calculate metrics for pre-ranking
71
+ print("\nPre-ranking performance (training queries only):")
72
+ for k in k_values:
73
+ recall_at_k = mean_recall_at_k(true_labels, pre_ranking_labels, k=k)
74
+ print(f" Recall@{k}: {recall_at_k:.4f}")
75
+
76
+ map_score = mean_average_precision(true_labels, pre_ranking_labels)
77
+ print(f" MAP: {map_score:.4f}")
78
+
79
+ inv_rank = mean_inv_ranking(true_labels, pre_ranking_labels)
80
+ print(f" Mean Inverse Rank: {inv_rank:.4f}")
81
+
82
+ rank = mean_ranking(true_labels, pre_ranking_labels)
83
+ print(f" Mean Rank: {rank:.2f}")
84
+
85
+ # Calculate metrics for re-ranking
86
+ print("\nRe-ranking performance (training queries only):")
87
+ for k in k_values:
88
+ recall_at_k = mean_recall_at_k(true_labels, re_ranking_labels, k=k)
89
+ print(f" Recall@{k}: {recall_at_k:.4f}")
90
+
91
+ map_score = mean_average_precision(true_labels, re_ranking_labels)
92
+ print(f" MAP: {map_score:.4f}")
93
+
94
+ inv_rank = mean_inv_ranking(true_labels, re_ranking_labels)
95
+ print(f" Mean Inverse Rank: {inv_rank:.4f}")
96
+
97
+ rank = mean_ranking(true_labels, re_ranking_labels)
98
+ print(f" Mean Rank: {rank:.2f}")
99
+
100
+ if __name__ == "__main__":
101
+ main()
metrics.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Evaluation metrics for document ranking.
3
+ This file contains implementation of various evaluation metrics
4
+ for assessing the quality of document rankings.
5
+ """
6
+ import numpy as np
7
+
8
+ def recall_at_k(true_items, predicted_items, k=10):
9
+ """
10
+ Calculate recall at k for a single query.
11
+
12
+ Parameters:
13
+ true_items (list): List of true relevant items
14
+ predicted_items (list): List of predicted items (ranked)
15
+ k (int): Number of top items to consider
16
+
17
+ Returns:
18
+ float: Recall@k value between 0 and 1
19
+ """
20
+ if not true_items:
21
+ return 0.0 # No relevant items to recall
22
+
23
+ # Get the top k predicted items
24
+ top_k_items = predicted_items[:k]
25
+
26
+ # Count the number of true items in the top k predictions
27
+ relevant_in_top_k = sum(1 for item in top_k_items if item in true_items)
28
+
29
+ # Calculate recall: (relevant items in top k) / (total relevant items)
30
+ return relevant_in_top_k / len(true_items)
31
+
32
+ def mean_recall_at_k(true_items_list, predicted_items_list, k=10):
33
+ """
34
+ Calculate mean recall at k across multiple queries.
35
+
36
+ Parameters:
37
+ true_items_list (list of lists): List of true relevant items for each query
38
+ predicted_items_list (list of lists): List of predicted items for each query
39
+ k (int): Number of top items to consider
40
+
41
+ Returns:
42
+ float: Mean Recall@k value between 0 and 1
43
+ """
44
+ if len(true_items_list) != len(predicted_items_list):
45
+ raise ValueError("Number of true item lists must match number of predicted item lists")
46
+
47
+ if not true_items_list:
48
+ return 0.0 # No data provided
49
+
50
+ # Calculate recall@k for each query
51
+ recalls = [recall_at_k(true_items, predicted_items, k)
52
+ for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
53
+
54
+ # Return mean recall@k
55
+ return sum(recalls) / len(recalls)
56
+
57
+ def average_precision(true_items, predicted_items):
58
+ """
59
+ Calculate average precision for a single query.
60
+
61
+ Parameters:
62
+ true_items (list): List of true relevant items
63
+ predicted_items (list): List of predicted items (ranked)
64
+
65
+ Returns:
66
+ float: Average precision value between 0 and 1
67
+ """
68
+ if not true_items or not predicted_items:
69
+ return 0.0
70
+
71
+ # Track number of relevant items seen and running sum of precision values
72
+ relevant_count = 0
73
+ precision_sum = 0.0
74
+
75
+ # Calculate precision at each position where a relevant item is found
76
+ for i, item in enumerate(predicted_items):
77
+ position = i + 1 # 1-indexed position
78
+
79
+ if item in true_items:
80
+ relevant_count += 1
81
+ # Precision at this position = relevant items seen / position
82
+ precision_at_position = relevant_count / position
83
+ precision_sum += precision_at_position
84
+
85
+ # Average precision = sum of precision values / total relevant items
86
+ total_relevant = len(true_items)
87
+ return precision_sum / total_relevant if total_relevant > 0 else 0.0
88
+
89
+ def mean_average_precision(true_items_list, predicted_items_list):
90
+ """
91
+ Calculate mean average precision (MAP) across multiple queries.
92
+
93
+ Parameters:
94
+ true_items_list (list of lists): List of true relevant items for each query
95
+ predicted_items_list (list of lists): List of predicted items for each query
96
+
97
+ Returns:
98
+ float: MAP value between 0 and 1
99
+ """
100
+ if len(true_items_list) != len(predicted_items_list):
101
+ raise ValueError("Number of true item lists must match number of predicted item lists")
102
+
103
+ if not true_items_list:
104
+ return 0.0 # No data provided
105
+
106
+ # Calculate average precision for each query
107
+ aps = [average_precision(true_items, predicted_items)
108
+ for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
109
+
110
+ # Return mean average precision
111
+ return sum(aps) / len(aps)
112
+
113
+ def inverse_ranking(true_items, predicted_items):
114
+ """
115
+ Calculate inverse ranking for the first relevant item.
116
+
117
+ Parameters:
118
+ true_items (list): List of true relevant items
119
+ predicted_items (list): List of predicted items (ranked)
120
+
121
+ Returns:
122
+ float: Inverse ranking value between 0 and 1
123
+ """
124
+ if not true_items or not predicted_items:
125
+ return 0.0
126
+
127
+ # Find position of first relevant item (1-indexed)
128
+ for i, item in enumerate(predicted_items):
129
+ if item in true_items:
130
+ rank = i + 1
131
+ return 1.0 / rank # Inverse ranking
132
+
133
+ # No relevant items found in predictions
134
+ return 0.0
135
+
136
+ def mean_inv_ranking(true_items_list, predicted_items_list):
137
+ """
138
+ Calculate mean inverse ranking (MIR) across multiple queries.
139
+
140
+ Parameters:
141
+ true_items_list (list of lists): List of true relevant items for each query
142
+ predicted_items_list (list of lists): List of predicted items for each query
143
+
144
+ Returns:
145
+ float: MIR value between 0 and 1
146
+ """
147
+ if len(true_items_list) != len(predicted_items_list):
148
+ raise ValueError("Number of true item lists must match number of predicted item lists")
149
+
150
+ if not true_items_list:
151
+ return 0.0 # No data provided
152
+
153
+ # Calculate inverse ranking for each query
154
+ inv_ranks = [inverse_ranking(true_items, predicted_items)
155
+ for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
156
+
157
+ # Return mean inverse ranking
158
+ return sum(inv_ranks) / len(inv_ranks)
159
+
160
+ def ranking(true_items, predicted_items):
161
+ """
162
+ Calculate the rank of the first relevant item.
163
+
164
+ Parameters:
165
+ true_items (list): List of true relevant items
166
+ predicted_items (list): List of predicted items (ranked)
167
+
168
+ Returns:
169
+ float: Rank of the first relevant item (1-indexed)
170
+ """
171
+ if not true_items or not predicted_items:
172
+ return float('inf') # No relevant items to find
173
+
174
+ # Find position of first relevant item (1-indexed)
175
+ for i, item in enumerate(predicted_items):
176
+ if item in true_items:
177
+ return i + 1 # Return rank (1-indexed)
178
+
179
+ # No relevant items found in predictions
180
+ return float('inf')
181
+
182
+ def mean_ranking(true_items_list, predicted_items_list):
183
+ """
184
+ Calculate mean ranking across multiple queries.
185
+
186
+ Parameters:
187
+ true_items_list (list of lists): List of true relevant items for each query
188
+ predicted_items_list (list of lists): List of predicted items for each query
189
+
190
+ Returns:
191
+ float: Mean ranking value (higher is worse)
192
+ """
193
+ if len(true_items_list) != len(predicted_items_list):
194
+ raise ValueError("Number of true item lists must match number of predicted item lists")
195
+
196
+ if not true_items_list:
197
+ return float('inf') # No data provided
198
+
199
+ # Calculate ranking for each query
200
+ ranks = [ranking(true_items, predicted_items)
201
+ for true_items, predicted_items in zip(true_items_list, predicted_items_list)]
202
+
203
+ # Filter out 'inf' values for mean calculation
204
+ finite_ranks = [r for r in ranks if r != float('inf')]
205
+
206
+ # Return mean ranking
207
+ return sum(finite_ranks) / len(finite_ranks) if finite_ranks else float('inf')
requirements.txt ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ appnope==0.1.4
2
+ asttokens==3.0.0
3
+ certifi==2025.1.31
4
+ charset-normalizer==3.4.1
5
+ comm==0.2.2
6
+ contourpy==1.3.1
7
+ cycler==0.12.1
8
+ debugpy==1.8.13
9
+ decorator==5.2.1
10
+ executing==2.2.0
11
+ filelock==3.18.0
12
+ fonttools==4.57.0
13
+ fsspec==2025.3.2
14
+ huggingface-hub==0.30.1
15
+ idna==3.10
16
+ ipykernel==6.29.5
17
+ ipython==9.0.2
18
+ ipython_pygments_lexers==1.1.1
19
+ jedi==0.19.2
20
+ Jinja2==3.1.6
21
+ joblib==1.4.2
22
+ jupyter_client==8.6.3
23
+ jupyter_core==5.7.2
24
+ kiwisolver==1.4.8
25
+ llvmlite==0.44.0
26
+ MarkupSafe==3.0.2
27
+ matplotlib==3.10.1
28
+ matplotlib-inline==0.1.7
29
+ mpmath==1.3.0
30
+ nest-asyncio==1.6.0
31
+ networkx==3.4.2
32
+ numba==0.61.0
33
+ numpy==2.1.0
34
+ packaging==24.2
35
+ pandas==2.2.3
36
+ parso==0.8.4
37
+ pexpect==4.9.0
38
+ pillow==11.1.0
39
+ platformdirs==4.3.7
40
+ prompt_toolkit==3.0.50
41
+ psutil==7.0.0
42
+ ptyprocess==0.7.0
43
+ pure_eval==0.2.3
44
+ Pygments==2.19.1
45
+ pynndescent==0.5.13
46
+ pyparsing==3.2.3
47
+ python-dateutil==2.9.0.post0
48
+ pytz==2025.2
49
+ PyYAML==6.0.2
50
+ pyzmq==26.4.0
51
+ regex==2024.11.6
52
+ requests==2.32.3
53
+ safetensors==0.5.3
54
+ scikit-learn==1.6.1
55
+ scipy==1.15.2
56
+ seaborn==0.13.2
57
+ sentence-transformers==4.0.2
58
+ setuptools==78.1.0
59
+ six==1.17.0
60
+ stack-data==0.6.3
61
+ sympy==1.13.1
62
+ threadpoolctl==3.6.0
63
+ tokenizers==0.21.1
64
+ torch==2.6.0
65
+ tornado==6.4.2
66
+ tqdm==4.67.1
67
+ traitlets==5.14.3
68
+ transformers==4.50.3
69
+ typing_extensions==4.13.1
70
+ tzdata==2025.2
71
+ umap-learn==0.5.7
72
+ urllib3==2.3.0
73
+ wcwidth==0.2.13