Spaces:

JetBrains-Research
/

long-code-arena

Running

App Files Files Community

galtimur commited on 13 days ago

Commit

49c07f7

1 Parent(s): f1b1d8d

Raplaced files with up to date version from BenchName

Browse files

Files changed (17) hide show

.gitattributes +0 -0
README.md +3 -14
app.py +9 -1
requirements.txt +0 -0
src/__init__.py +0 -0
src/content.py +14 -14
src/evaluation/__init__.py +0 -0
src/evaluation/base_task_metrics.py +0 -0
src/evaluation/commit_message_generation/__init__.py +0 -0
src/evaluation/commit_message_generation/cmg_metrics.py +0 -0
src/evaluation/metrics.py +0 -0
src/formatting.py +0 -0
src/get_results_for_task.py +15 -11
src/leaderboard_formatting.py +37 -15
src/submission_uploader.py +2 -2
src/tasks_content.py +42 -27
src/utils.py +0 -0

.gitattributes CHANGED Viewed

File without changes

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-title: Long Code Arena
-emoji: 🏟️
 colorFrom: yellow
-colorTo: purple
 sdk: gradio
 sdk_version: 4.36.1
 app_file: app.py
@@ -10,14 +10,3 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-## Citing
-```
-@article{bogomolov2024long,
-  title={Long Code Arena: a Set of Benchmarks for Long-Context Code Models},
-  author={Bogomolov, Egor and Eliseeva, Aleksandra and Galimzyanov, Timur and Glukhov, Evgeniy and Shapkin, Anton and Tigina, Maria and Golubev, Yaroslav and Kovrigin, Alexander and van Deursen, Arie and Izadi, Maliheh and Bryksin, Timofey},
-  journal={arXiv preprint arXiv:2406.11612},
-  year={2024}
-}
-```
-You can find the paper [here](https://arxiv.org/abs/2406.11612).

 ---
+title: BenchName
+emoji: 🚀
 colorFrom: yellow
+colorTo: red
 sdk: gradio
 sdk_version: 4.36.1
 app_file: app.py
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py CHANGED Viewed

@@ -32,7 +32,7 @@ logging.basicConfig(
 )
 submission_uploader = SubmissionUploader(
-    dataset_id=os.environ["DATASET_ID"], private_dataset_id=os.environ["PRIVATE_DATASET_ID"]
 )
@@ -58,6 +58,14 @@ def get_leaderboard_for_completion_task(dataset_name: str | None):
     )
 with gr.Blocks() as demo:
     # intro
     gr.HTML(INTRODUCTION_TITLE)

 )
 submission_uploader = SubmissionUploader(
+    dataset_id=os.environ["DATASET_ID"], private_dataset_id=os.getenv("PRIVATE_DATASET_ID")
 )
     )
+def get_aggregated_leaderboard_for_task(task_pretty: str) -> gr.components.Dataframe:
+    return gr.components.Dataframe(
+        value=get_results_for_task(task_pretty),
+        interactive=False,
+        datatype=get_types_per_task(TASKS_PRETTY_REVERSE[task_pretty]),
+    )
 with gr.Blocks() as demo:
     # intro
     gr.HTML(INTRODUCTION_TITLE)

requirements.txt CHANGED Viewed

File without changes

src/__init__.py CHANGED Viewed

File without changes

src/content.py CHANGED Viewed

@@ -3,22 +3,22 @@ from .formatting import styled_warning
 # ================================
 # =            ABOUT             =
 # ================================
-INTRODUCTION_TITLE = """<h1 align="center">🏟️ Long Code Arena</h1>"""
-INTRODUCTION_TEXT = """🏟️ **Long Code Arena** is a suite of benchmarks for code-related tasks with large contexts, up to a whole code repository.
 It currently spans six different tasks and contains six datasets:
-* 🤗 [Library-based code generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation)
-* 🤗 [CI builds repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
-* 🤗 [Project-level code completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion)
-* 🤗 [Commit message generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation)
-* 🤗 [Bug localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization)
-* 🤗 [Module summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization)
-We are excited to invite you to participate in solving our benchmarks! To submit your results, please send the following materials to our 📩 email (lca@jetbrains.com):
 * **Results**: Include the summary of your benchmark outcomes.
-* **Reproduction Package**: To ensure the integrity and reproducibility of your results, please include the code for context collection (if any), generation of predictions, and evaluating. You can follow [our baselines](https://github.com/JetBrains-Research/lca-baselines) as a reference.
 * **Metadata**: Model information, organization name, licence of your model, context size, and other information you find relevant.
 We look forward to reviewing your innovative solutions!
@@ -30,23 +30,23 @@ We look forward to reviewing your innovative solutions!
 # ================================
 LEADERBOARD_TITLE = '<h2 align="center">🏅Leaderboard</h2>'
-LEADERBOARD_TEXT = """The raw results from the leaderboard are available in 🤗 [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results)."""
 # ================================
 # =          SUBMISSION          =
 # ================================
 SUBMISSION_TITLE = '<h2 align="center">📩 Make A Submission</h2>'
-SUBMISSION_TEXT_INTRO = """Use the form below to submit new results to 🏟️ Long Code Arena. If any problems arise, don't hesitate to contact us by email `TODO` or open a discussion 💛"""
 SUBMISSION_TEXT_TASK = """1. Select a task you want to submit results for."""
 SUBMISSION_TEXT_METADATA = """2. Fill in some metadata about your submission."""
 SUBMISSION_TEXT_FILES = """3. Attach one or more files with your model's predictions.
-    * If several files are attached, they will be treated as separate runs of the submitted model (e.g., with different seeds), and the metrics will be averaged across runs. For baselines provided by 🏟️ Long Code Arena Team, the results are averaged across 3 runs.
 """
-SUBMISSION_TEXT_SUBMIT = """All set! A new PR to 🤗 [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results) should be opened when you press "Submit" button. 🏟️ Long Code Arena Team will review it shortly, and the results will appear in the leaderboard.
 ⏳ **Note:** It might take some time (up to 40 minutes) for PR to get created, since it involves computing metrics for your submission."""

 # ================================
 # =            ABOUT             =
 # ================================
+INTRODUCTION_TITLE = """<h1 align="center">🏟️ BenchName </h1>"""
+INTRODUCTION_TEXT = """🏟️ **BenchName** is a suite of benchmarks for code-related tasks with large contexts, up to a whole code repository.
 It currently spans six different tasks and contains six datasets:
+* 🤗 [Library-based code generation](https://huggingface.co/datasets/icmlbenchname/library-based-code-generation)
+* 🤗 [CI builds repair](https://huggingface.co/datasets/icmlbenchname/ci-builds-repair)
+* 🤗 [Project-level code completion](https://huggingface.co/datasets/icmlbenchname/project-level-code-completion)
+* 🤗 [Commit message generation](https://huggingface.co/datasets/icmlbenchname/commit-message-generation)
+* 🤗 [Bug localization](https://huggingface.co/datasets/icmlbenchname/bug-localization)
+* 🤗 [Module summarization](https://huggingface.co/datasets/icmlbenchname/module-summarization)
+We are excited to invite you to participate in solving our benchmarks! To submit your results, please send the following materials to our 📩 email (icmlbenchname@gmail.com):
 * **Results**: Include the summary of your benchmark outcomes.
+* **Reproduction Package**: To ensure the integrity and reproducibility of your results, please include the code for context collection (if any), generation of predictions, and evaluating. You can follow [baselines](https://anonymous.4open.science/r/icml-benchname-2025/README.md) as a reference.
 * **Metadata**: Model information, organization name, licence of your model, context size, and other information you find relevant.
 We look forward to reviewing your innovative solutions!
 # ================================
 LEADERBOARD_TITLE = '<h2 align="center">🏅Leaderboard</h2>'
+LEADERBOARD_TEXT = """The raw results from the leaderboard are available in 🤗 [icmlbenchname/results](https://huggingface.co/datasets/icmlbenchname/results)."""
 # ================================
 # =          SUBMISSION          =
 # ================================
 SUBMISSION_TITLE = '<h2 align="center">📩 Make A Submission</h2>'
+SUBMISSION_TEXT_INTRO = """Use the form below to submit new results to 🏟️ BenchName. If any problems arise, don't hesitate to contact us by email `TODO` or open a discussion 💛"""
 SUBMISSION_TEXT_TASK = """1. Select a task you want to submit results for."""
 SUBMISSION_TEXT_METADATA = """2. Fill in some metadata about your submission."""
 SUBMISSION_TEXT_FILES = """3. Attach one or more files with your model's predictions.
+    * If several files are attached, they will be treated as separate runs of the submitted model (e.g., with different seeds), and the metrics will be averaged across runs. For baselines provided by 🏟️ BenchName Team, the results are averaged across 3 runs.
 """
+SUBMISSION_TEXT_SUBMIT = """All set! A new PR to 🤗 [icmlbenchname/results](https://huggingface.co/datasets/icmlbenchname/results) should be opened when you press "Submit" button. 🏟️ BenchName Team will review it shortly, and the results will appear in the leaderboard.
 ⏳ **Note:** It might take some time (up to 40 minutes) for PR to get created, since it involves computing metrics for your submission."""

src/evaluation/__init__.py CHANGED Viewed

File without changes

src/evaluation/base_task_metrics.py CHANGED Viewed

File without changes

src/evaluation/commit_message_generation/__init__.py CHANGED Viewed

File without changes

src/evaluation/commit_message_generation/cmg_metrics.py CHANGED Viewed

File without changes

src/evaluation/metrics.py CHANGED Viewed

File without changes

src/formatting.py CHANGED Viewed

File without changes

src/get_results_for_task.py CHANGED Viewed

@@ -37,7 +37,7 @@ def _get_results_stub() -> pd.DataFrame:
                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
-                "Submitted By": "🏟 Long Code Arena Team",
                 "Resources": "",
             },
             {
@@ -49,7 +49,7 @@ def _get_results_stub() -> pd.DataFrame:
                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
-                "Submitted By": "🏟 Long Code Arena Team",
                 "Resources": "",
             },
         ]
@@ -77,27 +77,31 @@ def _get_results_dataset(task_id: str) -> pd.DataFrame:
         os.environ["DATASET_ID"], task_id, split="test", download_mode="force_redownload"
     ).to_pandas()
     results_df = results_df.rename(columns=COLUMNS_PRETTY, errors="ignore")
-    results_df["Context Size"] = results_df["Context Size"].map(lambda x: f"{int(x) // 1000}k" if int(x) >= 1000 else x)
-    results_df = results_df.sort_values(by=SORT_COLUMN_PER_TASK[task_id], ascending=False)
     for metric_column in METRICS_PER_TASK[task_id]:
         if "BERTScore" in metric_column:
             results_df[metric_column] = results_df[metric_column].map(lambda x: f"{x:.5f}")
         else:
             results_df[metric_column] = results_df[metric_column].map(lambda x: f"{x:.2f}")
-    results_df["Model Name"] = [
-        model_hyperlink(link=link, model_name=model_name) if link else model_name
-        for link, model_name in zip(results_df["model_url"], results_df["Model Name"])
-    ]
     if task_id == 'project_code_completion':
         results_df["Dataset Name"] = [_extract_dataset_name(urls) for urls in results_df["Dataset"]]
         results_df["Dataset"] = [_process_urls(urls) for urls in results_df["Dataset"]]
-    results_df["Resources"] = [_process_urls(urls) for urls in results_df["Resources"]]
     results_df = results_df[get_columns_per_task(task_id)]
-    if task_id == 'ci_builds_repair':
-        results_df = results_df.rename(columns={"Context Size": "Context"})
     return results_df

                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
+                "Submitted By": "BenchName Team",
                 "Resources": "",
             },
             {
                 "ChrF": "X",
                 "BERTScore": "X",
                 "BERTScore (Normalized)": "X",
+                "Submitted By": "BenchName Team",
                 "Resources": "",
             },
         ]
         os.environ["DATASET_ID"], task_id, split="test", download_mode="force_redownload"
     ).to_pandas()
     results_df = results_df.rename(columns=COLUMNS_PRETTY, errors="ignore")
+    if task_id != "aggregated":
+        results_df["Context Size"] = results_df["Context Size"].map(lambda x: f"{int(x) // 1000}k" if int(x) >= 1000 else x)
+        results_df["Resources"] = [_process_urls(urls) for urls in results_df["Resources"]]
+        results_df = results_df.sort_values(by=SORT_COLUMN_PER_TASK[task_id], ascending=False)
     for metric_column in METRICS_PER_TASK[task_id]:
         if "BERTScore" in metric_column:
             results_df[metric_column] = results_df[metric_column].map(lambda x: f"{x:.5f}")
+        elif "Mean Rank" in metric_column:
+            continue
         else:
             results_df[metric_column] = results_df[metric_column].map(lambda x: f"{x:.2f}")
+    if task_id == 'aggregated':
+        results_df["Model Name"] = results_df["Model"]
+    else:
+        results_df["Model Name"] = [
+            model_hyperlink(link=link, model_name=model_name) if link else model_name
+            for link, model_name in zip(results_df["model_url"], results_df["Model Name"])
+        ]
     if task_id == 'project_code_completion':
         results_df["Dataset Name"] = [_extract_dataset_name(urls) for urls in results_df["Dataset"]]
         results_df["Dataset"] = [_process_urls(urls) for urls in results_df["Dataset"]]
     results_df = results_df[get_columns_per_task(task_id)]
     return results_df

src/leaderboard_formatting.py CHANGED Viewed

@@ -20,16 +20,26 @@ COLUMNS_PRETTY = {
     "EM commited": "EM committed",
     "EM non_informative": "EM non-informative",
     "EM random": "EM random",
-    "EM all": "EM all",
     "dataset": "Dataset",
     "CompScore": "CompScore",
     "context": "Context",
     "task_type": "Task type",
-    "date": "Date, mm/yy",
 }
 # Add your metrics
 METRICS_PER_TASK = {
     "commit_message_generation": [
         "BLEU",
         "ChrF",
@@ -49,18 +59,28 @@ METRICS_PER_TASK = {
         "EM all",
     ],
     "bug_localization": [
-        "R@1",
-        "R@2",
-        "P@2",
-        "f1-score",
-        "MAP",
     ],
     "module_summarization": [
         "CompScore",
     ],
     "library_based_code_generation": [
-        "ChrF",
-        "API Recall",
     ],
     "ci_builds_repair": [
         "Pass@1",
@@ -73,15 +93,17 @@ SORT_COLUMN_PER_TASK = {
     "project_code_completion": "EM inproject",
     "bug_localization": "Model Name",
     "module_summarization": "CompScore",
-    "library_based_code_generation": "API Recall",
     "ci_builds_repair": "Pass@1",
 }
 def get_columns_per_task(task_id: str) -> List[str]:
     metrics_per_task = METRICS_PER_TASK[task_id]
     if task_id == 'project_code_completion':
-        return ["Model Name", "Context Size", "Dataset Name", "Dataset"] + metrics_per_task + ["Availability", "Submitted By", "Resources"]
     if task_id == 'bug_localization':
         return ["Model Name", "Availability", "Context Size"] + metrics_per_task + ["Submitted By", "Resources"]
@@ -89,10 +111,10 @@ def get_columns_per_task(task_id: str) -> List[str]:
         return ["Model Name", "Context Size"] + metrics_per_task + ["Submitted By", "Resources"]
     if task_id == 'library_based_code_generation':
-        return ["Model Name", "Context"] + metrics_per_task + ["Availability", "Submitted By", "Resources"]
     if task_id == 'ci_builds_repair':
-        return ["Model Name", "Context Size", "Task type"] + metrics_per_task + ["Pass/golden", "Availability", "Submitted By", "Resources", "Date, mm/yy"]
     return ["Model Name", "Context Size", "Availability"] + metrics_per_task + ["Submitted By", "Resources"]
@@ -100,9 +122,9 @@ def get_columns_per_task(task_id: str) -> List[str]:
 def get_types_per_task(task_id: str) -> List[str]:
     metrics_per_task = METRICS_PER_TASK.get(task_id, (0, 0, 0, 0, 0))
     if task_id == 'project_code_completion':
-        return ["html", "markdown", "markdown", "html"] + ["number" for _ in metrics_per_task] + ["markdown", "markdown", "html"]
     if task_id == 'bug_localization':
         return ["html", "markdown", "markdown"] + ["number" for _ in metrics_per_task] + ["markdown", "html"]
     if task_id == 'ci_builds_repair':
-        return ["html", "markdown", "markdown"] + ["number" for _ in metrics_per_task] + ["markdown", "markdown", "markdown", "html"]
     return ["html", "markdown", "markdown"] + ["number" for _ in metrics_per_task] + ["markdown", "html"]

     "EM commited": "EM committed",
     "EM non_informative": "EM non-informative",
     "EM random": "EM random",
+    "EM all": "EM all",
+    "context_composer": "Context Composer",
+    "context_length": "Context Size",
     "dataset": "Dataset",
     "CompScore": "CompScore",
     "context": "Context",
     "task_type": "Task type",
 }
 # Add your metrics
 METRICS_PER_TASK = {
+    "aggregated": [
+        "Mean Rank",
+        "Mean Score",
+        "Library-based CG",
+        "CI builds repair",
+        "CMG",
+        "Bug localization",
+        "Module summarization",
+    ],
     "commit_message_generation": [
         "BLEU",
         "ChrF",
         "EM all",
     ],
     "bug_localization": [
+        "P",
+        "R",
+        "FPR",
+        "F1-score",
+        "All_correct",
+        "All_incorrect",
+        "Output_count",
     ],
     "module_summarization": [
         "CompScore",
     ],
     "library_based_code_generation": [
+        "API Recall\nno context",
+        "API Recall\n20 APIs",
+        "API Recall\n200 APIs",
+        "API Recall\n2,000 APIs",
+        "API Recall\nall APIs",
+        "ChrF\nno context",
+        "ChrF\n20 APIs",
+        "ChrF\n200 APIs",
+        "ChrF\n2,000 APIs",
+        "ChrF\nall APIs",
     ],
     "ci_builds_repair": [
         "Pass@1",
     "project_code_completion": "EM inproject",
     "bug_localization": "Model Name",
     "module_summarization": "CompScore",
+    "library_based_code_generation": "API Recall\nall APIs",
     "ci_builds_repair": "Pass@1",
 }
 def get_columns_per_task(task_id: str) -> List[str]:
     metrics_per_task = METRICS_PER_TASK[task_id]
+    if task_id == 'aggregated':
+        return ["Model Name"] + metrics_per_task
     if task_id == 'project_code_completion':
+        return ["Model Name", "Context Composer", "Context Size", "Dataset Name", "Dataset"] + metrics_per_task + ["Submitted By", "Resources"]
     if task_id == 'bug_localization':
         return ["Model Name", "Availability", "Context Size"] + metrics_per_task + ["Submitted By", "Resources"]
         return ["Model Name", "Context Size"] + metrics_per_task + ["Submitted By", "Resources"]
     if task_id == 'library_based_code_generation':
+        return ["Model Name"] + metrics_per_task + ["Availability", "Submitted By", "Resources"]
     if task_id == 'ci_builds_repair':
+        return ["Model Name", "Context Size", "Task type"] + metrics_per_task + ["Availability", "Submitted By", "Resources"]
     return ["Model Name", "Context Size", "Availability"] + metrics_per_task + ["Submitted By", "Resources"]
 def get_types_per_task(task_id: str) -> List[str]:
     metrics_per_task = METRICS_PER_TASK.get(task_id, (0, 0, 0, 0, 0))
     if task_id == 'project_code_completion':
+        return ["html", "markdown", "markdown", "markdown", "html"] + ["number" for _ in metrics_per_task] + ["markdown", "html"]
     if task_id == 'bug_localization':
         return ["html", "markdown", "markdown"] + ["number" for _ in metrics_per_task] + ["markdown", "html"]
     if task_id == 'ci_builds_repair':
+        return ["html", "markdown", "markdown"] + ["number" for _ in metrics_per_task] + ["markdown", "markdown", "html"]
     return ["html", "markdown", "markdown"] + ["number" for _ in metrics_per_task] + ["markdown", "html"]

src/submission_uploader.py CHANGED Viewed

@@ -31,8 +31,8 @@ class SubmissionUploader:
     """
     def __init__(self, dataset_id: str, private_dataset_id: str):
-        self._api = HfApi(token=os.environ["HF_TOKEN"])
-        self._fs = HfFileSystem(token=os.environ["HF_TOKEN"])
         self._results_dataset_id = dataset_id
         self._requests_dataset_id = private_dataset_id

     """
     def __init__(self, dataset_id: str, private_dataset_id: str):
+        self._api = HfApi(token=os.getenv("HF_TOKEN"))
+        self._fs = HfFileSystem(token=os.getenv("HF_TOKEN"))
         self._results_dataset_id = dataset_id
         self._requests_dataset_id = private_dataset_id

src/tasks_content.py CHANGED Viewed

@@ -1,6 +1,7 @@
 from typing import Optional
 TASKS_PRETTY = {
     "library_based_code_generation": "Library-based code generation",
     "ci_builds_repair": "CI builds repair",
     "project_code_completion": "Project-level code completion",
@@ -11,24 +12,40 @@ TASKS_PRETTY = {
 TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}
 TASKS_DESCRIPTIONS = {
     "library_based_code_generation": """# Library-based code generation\n
-        Our Library-based code generation benchmark 🤗 [JetBrains-Research/lca-library-based-code-generation](https://huggingface.co/datasets/JetBrains-Research/lca-library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
         For evaluation, we use two metrics:
         * `ChrF`: textual similarity between the generated code and the reference program.
         * `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,
-        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `library_based_code_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
-        If you have any questions or requests concerning this dataset, please contact us at [email protected].
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "ci_builds_repair": """# CI builds repair\n
-        Our CI builds repair benchmark 🤗 [JetBrains-Research/lca-ci-builds-repair](https://huggingface.co/datasets/JetBrains-Research/lca-ci-builds-repair)
         includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
         The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
@@ -40,16 +57,14 @@ TASKS_DESCRIPTIONS = {
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
-        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `ci-builds-repair` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
-        If you have any questions or requests concerning this dataset, please contact us at [email protected].
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "project_code_completion": """# Project-level code completion\n
-        Our Project-level code completion benchmark 🤗 [JetBrains-Research/lca-project-level-code-completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) includes four sets of samples:
         * `small-context`: 144 data points,
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
@@ -67,16 +82,14 @@ TASKS_DESCRIPTIONS = {
         * *non-informative* – short/long lines, import/print lines, or comment lines;
         * *random* – lines that don't fit any of the previous categories.
-        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `project_level_code_completion` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
-        If you have any questions or requests concerning this dataset, please contact us at [email protected].
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "commit_message_generation": """# Commit message generation\n
-        Our Commit message generation benchmark 🤗 [JetBrains-Research/lca-commit-message-generation](https://huggingface.co/datasets/JetBrains-Research/lca-commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
@@ -84,39 +97,41 @@ TASKS_DESCRIPTIONS = {
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
-        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `commit_message_generation` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
         **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
-        If you have any questions or requests concerning this dataset, please contact us at [email protected].
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "bug_localization": """# Bug localization\n
-        Our Bug localization benchmark 🤗 [JetBrains-Research/lca-bug-localization](https://huggingface.co/datasets/JetBrains-Research/lca-bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
         The model needs to identify the files within the repository that need to be modified to address the reported bug.
-        We used information retrieval metrics such as `R@k`, `P@k`, `F1-score`, and `MAP` for evaluation, taking `k` equal to 1 and 2.
-        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `bug_localization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines).
-        If you have any questions or requests concerning this dataset, please contact us at [email protected].
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
     """,
     "module_summarization": """# Module summarization\n
-        Our Module summarization benchmark 🤗 [JetBrains-Research/lca-module-summarization](https://huggingface.co/datasets/JetBrains-Research/lca-module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
         The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         We use a novel metric for evaluation:
-        * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/README.md).
-        For further details on the dataset and the baselines from the 🏟️ Long Code Arena team, refer to the `module_summarization` directory in [our baselines repository](https://github.com/JetBrains-Research/lca-baselines/blob/main/module_summarization/).
-        If you have any questions or requests concerning this dataset, please contact us at [email protected].
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
@@ -130,6 +145,6 @@ def get_submission_text_files_for_task(task_pretty: Optional[str]) -> str:
     task_id = TASKS_PRETTY_REVERSE[task_pretty]
     if task_id == "commit_message_generation":
-        return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by 🏟️ Long Code Arena Team in  🤗 [JetBrains-Research/lca-results](https://huggingface.co/datasets/JetBrains-Research/lca-results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""
     return f"**{task_pretty} Instructions:**\n\n* 🚧 There are no instructions for the current task yet."

 from typing import Optional
 TASKS_PRETTY = {
+    "aggregated": "Aggregated Results",
     "library_based_code_generation": "Library-based code generation",
     "ci_builds_repair": "CI builds repair",
     "project_code_completion": "Project-level code completion",
 TASKS_PRETTY_REVERSE = {value: key for key, value in TASKS_PRETTY.items()}
 TASKS_DESCRIPTIONS = {
+    "aggregated": """# Aggregated Results\n
+        Here, we present the aggregated results across all the tasks in BenchName (except for Project-level code completion, where its specifics required a different selection of models). To get more details about each task, visit the corresponding tab.
+        To obtain aggregated results, we first select only one metric from metric suite for each task:
+        * Library-based code generation: `API Recall`
+        * CI builds repair: `Pass@1`
+        * Commit message generation: `chrF`
+        * Bug localization: `F1-score`
+        * Module summarization: `CompScore`
+        Then, to ensure a fair comparison across tasks with different score ranges, we normalize all scores to a 0-1 scale, where zero corresponds to the worst-performing model, and 1 to the best one. Note that for mean rank, rather than using strict rankings, we implemented a ranking system with a 10% margin to account for models with similar performance.
+        We report mean rank (with std) and mean score across the tasks from BenchName, and the scores for each task in the table below.
+        """,
     "library_based_code_generation": """# Library-based code generation\n
+        Our Library-based code generation benchmark 🤗 [icmlbenchname/library-based-code-generation](https://huggingface.co/datasets/icmlbenchname/library-based-code-generation) includes 150 manually curated instructions asking a model to generate Python code using a particular library. Samples come from 62 Python repositories. All the samples in the dataset are based on reference example programs written by authors of the respective libraries.
         For evaluation, we use two metrics:
         * `ChrF`: textual similarity between the generated code and the reference program.
         * `API Recall`: share of library-specific API calls used in the reference program that appear in the generated code,
+        As a context, we pass a prefix of the list of APIs available in the target library.
+        We select the prefix based on their BM-25 similarity with the provided instruction.
+        For further details on the dataset and the baselines from the BenchName team, refer to the `library_based_code_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "ci_builds_repair": """# CI builds repair\n
+        Our CI builds repair benchmark 🤗 [icmlbenchname/ci-builds-repair](https://huggingface.co/datasets/icmlbenchname/ci-builds-repair)
         includes 77 manually curated and assessed data points coming from 32 Python repositories, which are used to make a model fix a failed build.
         The benchmark clones the repo to the local directory, the model fixes the issue according to logs and the local repo state,
         * `oracle: files` – ground truth diffs are used to select files that should be corrected to fix the issue;
         * `oracle: files, lines` – ground truth diffs are used to select files and code blocks that should be corrected to fix the issue;
+        For further details on the dataset and the baselines from the BenchName team, refer to the `ci-builds-repair` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "project_code_completion": """# Project-level code completion\n
+        Our Project-level code completion benchmark 🤗 [icmlbenchname/project-level-code-completion](https://huggingface.co/datasets/icmlbenchname/project-level-code-completion) includes four sets of samples:
         * `small-context`: 144 data points,
         * `medium-context`: 224 data points,
         * `large-context`: 270 data points,
         * *non-informative* – short/long lines, import/print lines, or comment lines;
         * *random* – lines that don't fit any of the previous categories.
+        For further details on the dataset and the baselines from the BenchName team, refer to the `project_level_code_completion` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "commit_message_generation": """# Commit message generation\n
+        Our Commit message generation benchmark 🤗 [icmlbenchname/commit-message-generation](https://huggingface.co/datasets/icmlbenchname/commit-message-generation) includes 163 manually curated commits with large diffs from 34 Python projects, which the model needs to generate commit messages for.
         We use the following metrics for evaluation:
         * [BLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu)
         * [ChrF](https://huggingface.co/spaces/evaluate-metric/chrf)
         * [BERTScore](https://huggingface.co/spaces/evaluate-metric/bertscore)
+        For further details on the dataset and the baselines from the BenchName team, refer to the `commit_message_generation` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Note.** The leaderboard is sorted by the `ROUGE-1` metric by default.
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     "bug_localization": """# Bug localization\n
+        Our Bug localization benchmark 🤗 [icmlbenchname/bug-localization](https://huggingface.co/datasets/icmlbenchname/bug-localization) includes 150 manually verified bug issue descriptions with information about pull request that fix them for Python, Java, and Kotlin projects.
         The model needs to identify the files within the repository that need to be modified to address the reported bug.
+        To evaluate baseline performance, we use the following classification metrics:
+        * **P** - precision to estimate how many of the predicted buggy files were correctly identified
+        * **R** - recall to indicate how many of the actual buggy files were correctly found
+        * **FPR** - false positive rate to indicate how many non-buggy files were incorrectly predicted as buggy
+        * **F1-score** - score to provide a balance between precision and recall
+        * **All correct** - percentage of cases where all buggy files were correctly identified
+        * **All incorrect** - percentage of cases where all buggy files were incorrectly identified
+        * **# Output** - average number of buggy files detected, to further assess performance, particularly concerning high **FPR**.
+        For further details on the dataset and the baselines from the BenchName team, refer to the `bug_localization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
     """,
     "module_summarization": """# Module summarization\n
+        Our Module summarization benchmark 🤗 [icmlbenchname/module-summarization](https://huggingface.co/datasets/icmlbenchname/module-summarization) includes 216 manually curated text files describing different documentation of open-source permissive Python projects.
         The model is required to generate such description, given the relevant context code and the intent behind the documentation.
         We use a novel metric for evaluation:
+        * `CompScore`: the new metric based on LLM as an assessor proposed for this task. Our approach involves feeding the LLM with relevant code and two versions of documentation: the ground truth and the model-generated text. More details on how it is calculated can be found in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/module_summarization/README.md).
+        For further details on the dataset and the baselines from the BenchName team, refer to the `module_summarization` directory in [our baselines repository](https://anonymous.4open.science/r/icml-benchname-2025/).
         **Terms of use**. As this dataset is collected from GitHub, researchers may use it for research purposes only if any publications resulting from that research are open access (see [GitHub Acceptable Use Policies](https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies#7-information-usage-restrictions)).
         """,
     task_id = TASKS_PRETTY_REVERSE[task_pretty]
     if task_id == "commit_message_generation":
+        return f"""**{task_pretty} Instructions:**\n\n* Please, attach files in [JSONLines format](https://jsonlines.org/). For an example, check the predictions provided by BenchName Team in  🤗 [icmlbenchname/results](https://huggingface.co/datasets/icmlbenchname/results/tree/main/commit_message_generation/predictions). Make sure to include `"prediction"` and `"reference"` fields for each example, the rest are optional."""
     return f"**{task_pretty} Instructions:**\n\n* 🚧 There are no instructions for the current task yet."

src/utils.py CHANGED Viewed

File without changes