mgyigit commited on
Commit
2d07a94
·
verified ·
1 Parent(s): 2d69cc8

Update src/about.py

Browse files
Files changed (1) hide show
  1. src/about.py +26 -3
src/about.py CHANGED
@@ -38,18 +38,41 @@ LLM_BENCHMARKS_TEXT = f"""
38
  1. Semantic Similarity Inference:
39
  - This benchmark evaluates how well protein representation models can infer functional similarities between proteins. Ground truth functional similarities are derived from Gene Ontology (GO) annotations.
40
  - Different distance metrics (Cosine, Manhattan, Euclidean) are used to compute protein vector similarities, which are then correlated with the functional similarities.
 
 
 
 
 
 
 
41
 
42
  2. Ontology-Based Protein Function Prediction (PFP):
43
- - This benchmark assesses the ability of representation models to predict ontology-based functional annotations (GO terms). The models are tested on how well they classify proteins based on molecular functions, biological processes, and cellular components.
44
- - A linear classifier is used to ensure that the models themselves are responsible for good performance, rather than the complexity of the classifier.
 
 
 
 
 
 
 
45
 
46
  3. Drug Target Protein Family Classification:
47
  - This benchmark focuses on predicting the family of drug target proteins (enzymes, receptors, ion channels, etc.). This task tests the ability of models to learn structural features critical to these classifications.
48
  - The study evaluates models using datasets with varying sequence similarity thresholds (random, 50%, 30%, 15%) to ensure the models can predict beyond simple sequence similarity.
 
 
 
 
 
49
 
50
  4. Protein–Protein Binding Affinity Estimation:
51
  - This benchmark evaluates models' ability to predict the change in binding affinities between proteins due to mutations. The dataset used is the **SKEMPI** dataset, which contains experimentally determined binding affinities.
52
  - The task measures how well models can extract critical structural features important for protein-protein interactions.
 
 
 
 
53
 
54
  ### PROBE is part of the the study entitled [Learning functional properties of proteins with language models](https://rdcu.be/cJAKN) which is schematically summarized in the figure below:<br/>
55
  """
@@ -84,7 +107,7 @@ Welcome to the PROBE (Protein RepresentatiOn BEnchmark) leaderboard! This platfo
84
  - **Protein Family**: Classifying drug target families.
85
  - **Protein Affinity**: Estimating binding affinities.
86
 
87
- Submit your own representation models and compare their performance across these tasks. For more details on how to participate, see the submission guidelines.
88
 
89
  If you find PROBE useful, please consider citing our work:
90
 
 
38
  1. Semantic Similarity Inference:
39
  - This benchmark evaluates how well protein representation models can infer functional similarities between proteins. Ground truth functional similarities are derived from Gene Ontology (GO) annotations.
40
  - Different distance metrics (Cosine, Manhattan, Euclidean) are used to compute protein vector similarities, which are then correlated with the functional similarities.
41
+
42
+ - Metrics (sim_ prefix):
43
+ • sim_sparse_MF_correlation/sim_200_MF_correlation/sim_500_MF_correlation: Correlation between protein embeddings and Molecular Function (MF) similarity scores
44
+ • sim_sparse_BP_correlation/sim_200_BP_correlation/sim_500_BP_correlation: Correlation between protein embeddings and Biological Process (BP) similarity scores
45
+ • sim_sparse_CC_correlation/sim_200_CC_correlation/sim_500_CC_correlation: Correlation between protein embeddings and Cellular Component (CC) similarity scores
46
+ • sim_sparse_Ave_correlation/sim_200_Ave_correlation/sim_500_Ave_correlation: Average correlation across MF, BP, and CC aspects
47
+ • sim_sparse_*_pvalue/sim_200_*_pvalue/sim_500_*_pvalue: Statistical significance (p-value) of the respective correlations
48
 
49
  2. Ontology-Based Protein Function Prediction (PFP):
50
+ - This benchmark assesses the ability of representation models to predict ontology-based functional annotations (GO terms). The models are tested on how well they classify proteins based on molecular functions (MF), biological processes (BP), and cellular components (CC).
51
+ - A linear classifier is used to ensure that the models themselves are responsible for good performance, rather than the complexity of the classifier.
52
+
53
+ - Metrics (func_ prefix):
54
+ • func_BP_accuracy/func_CC_accuracy/func_MF_accuracy: Accuracy of predicting protein function in respective GO aspects
55
+ • func_BP_F1/func_CC_F1/func_MF_F1: F1 score for protein function prediction
56
+ • func_BP_precision/func_CC_precision/func_MF_precision: Precision of function prediction
57
+ • func_BP_recall/func_CC_recall/func_MF_recall: Recall of function prediction
58
+ • func_Ave_accuracy/func_Ave_F1/func_Ave_precision/func_Ave_recall: Average metrics across BP, CC, and MF aspects
59
 
60
  3. Drug Target Protein Family Classification:
61
  - This benchmark focuses on predicting the family of drug target proteins (enzymes, receptors, ion channels, etc.). This task tests the ability of models to learn structural features critical to these classifications.
62
  - The study evaluates models using datasets with varying sequence similarity thresholds (random, 50%, 30%, 15%) to ensure the models can predict beyond simple sequence similarity.
63
+ - Metrics (fam_ prefix):
64
+ • fam_nc_*_ave: Metrics for family classification on the non-clustered (nc) dataset
65
+ • fam_uc50_*_ave: Metrics for family classification on the UniRef50 clustered (uc50) dataset
66
+ • fam_uc30_*_ave: Metrics for family classification on the UniRef30 clustered (uc30) dataset
67
+ • fam_mm15_*_ave: Metrics for family classification on the maximum 15% sequence identity (mm15) dataset
68
 
69
  4. Protein–Protein Binding Affinity Estimation:
70
  - This benchmark evaluates models' ability to predict the change in binding affinities between proteins due to mutations. The dataset used is the **SKEMPI** dataset, which contains experimentally determined binding affinities.
71
  - The task measures how well models can extract critical structural features important for protein-protein interactions.
72
+ - Metrics (aff_ prefix):
73
+ • aff_mse_ave: Mean Squared Error for binding affinity prediction
74
+ • aff_mae_ave: Mean Absolute Error for binding affinity prediction
75
+ • aff_corr_ave: Correlation between predicted and actual binding affinity values
76
 
77
  ### PROBE is part of the the study entitled [Learning functional properties of proteins with language models](https://rdcu.be/cJAKN) which is schematically summarized in the figure below:<br/>
78
  """
 
107
  - **Protein Family**: Classifying drug target families.
108
  - **Protein Affinity**: Estimating binding affinities.
109
 
110
+ Submit your own representation models and compare their performance across these tasks. For more details on how to participate, see the submission guidelines at Submit Here! tab. For descriptions of each benchmark and its metrics, please refer to the About tab.
111
 
112
  If you find PROBE useful, please consider citing our work:
113