mgyigit commited on
Commit
99abba4
·
verified ·
1 Parent(s): 2d07a94

Update src/about.py

Browse files
Files changed (1) hide show
  1. src/about.py +14 -2
src/about.py CHANGED
@@ -38,6 +38,10 @@ LLM_BENCHMARKS_TEXT = f"""
38
  1. Semantic Similarity Inference:
39
  - This benchmark evaluates how well protein representation models can infer functional similarities between proteins. Ground truth functional similarities are derived from Gene Ontology (GO) annotations.
40
  - Different distance metrics (Cosine, Manhattan, Euclidean) are used to compute protein vector similarities, which are then correlated with the functional similarities.
 
 
 
 
41
 
42
  - Metrics (sim_ prefix):
43
  • sim_sparse_MF_correlation/sim_200_MF_correlation/sim_500_MF_correlation: Correlation between protein embeddings and Molecular Function (MF) similarity scores
@@ -47,8 +51,9 @@ LLM_BENCHMARKS_TEXT = f"""
47
  • sim_sparse_*_pvalue/sim_200_*_pvalue/sim_500_*_pvalue: Statistical significance (p-value) of the respective correlations
48
 
49
  2. Ontology-Based Protein Function Prediction (PFP):
50
- - This benchmark assesses the ability of representation models to predict ontology-based functional annotations (GO terms). The models are tested on how well they classify proteins based on molecular functions (MF), biological processes (BP), and cellular components (CC).
51
  - A linear classifier is used to ensure that the models themselves are responsible for good performance, rather than the complexity of the classifier.
 
52
 
53
  - Metrics (func_ prefix):
54
  • func_BP_accuracy/func_CC_accuracy/func_MF_accuracy: Accuracy of predicting protein function in respective GO aspects
@@ -59,7 +64,12 @@ LLM_BENCHMARKS_TEXT = f"""
59
 
60
  3. Drug Target Protein Family Classification:
61
  - This benchmark focuses on predicting the family of drug target proteins (enzymes, receptors, ion channels, etc.). This task tests the ability of models to learn structural features critical to these classifications.
62
- - The study evaluates models using datasets with varying sequence similarity thresholds (random, 50%, 30%, 15%) to ensure the models can predict beyond simple sequence similarity.
 
 
 
 
 
63
  - Metrics (fam_ prefix):
64
  • fam_nc_*_ave: Metrics for family classification on the non-clustered (nc) dataset
65
  • fam_uc50_*_ave: Metrics for family classification on the UniRef50 clustered (uc50) dataset
@@ -69,6 +79,8 @@ LLM_BENCHMARKS_TEXT = f"""
69
  4. Protein–Protein Binding Affinity Estimation:
70
  - This benchmark evaluates models' ability to predict the change in binding affinities between proteins due to mutations. The dataset used is the **SKEMPI** dataset, which contains experimentally determined binding affinities.
71
  - The task measures how well models can extract critical structural features important for protein-protein interactions.
 
 
72
  - Metrics (aff_ prefix):
73
  • aff_mse_ave: Mean Squared Error for binding affinity prediction
74
  • aff_mae_ave: Mean Absolute Error for binding affinity prediction
 
38
  1. Semantic Similarity Inference:
39
  - This benchmark evaluates how well protein representation models can infer functional similarities between proteins. Ground truth functional similarities are derived from Gene Ontology (GO) annotations.
40
  - Different distance metrics (Cosine, Manhattan, Euclidean) are used to compute protein vector similarities, which are then correlated with the functional similarities.
41
+ - The benchmark uses three different datasets:
42
+ • Sparse: A sparse uniform dataset with broader protein coverage
43
+ • 200: A set of well-annotated 200 proteins
44
+ • 500: A set of well-annotated 500 proteins
45
 
46
  - Metrics (sim_ prefix):
47
  • sim_sparse_MF_correlation/sim_200_MF_correlation/sim_500_MF_correlation: Correlation between protein embeddings and Molecular Function (MF) similarity scores
 
51
  • sim_sparse_*_pvalue/sim_200_*_pvalue/sim_500_*_pvalue: Statistical significance (p-value) of the respective correlations
52
 
53
  2. Ontology-Based Protein Function Prediction (PFP):
54
+ - This benchmark assesses the ability of representation models to predict ontology-based functional annotations (GO terms). The models are tested on how well they classify proteins based on molecular functions, biological processes, and cellular components.
55
  - A linear classifier is used to ensure that the models themselves are responsible for good performance, rather than the complexity of the classifier.
56
+ - The evaluation uses 5-fold cross-validation to ensure robust assessment.
57
 
58
  - Metrics (func_ prefix):
59
  • func_BP_accuracy/func_CC_accuracy/func_MF_accuracy: Accuracy of predicting protein function in respective GO aspects
 
64
 
65
  3. Drug Target Protein Family Classification:
66
  - This benchmark focuses on predicting the family of drug target proteins (enzymes, receptors, ion channels, etc.). This task tests the ability of models to learn structural features critical to these classifications.
67
+ - The benchmark uses datasets with varying sequence similarity thresholds to test generalization:
68
+ • nc: Non-clustered/Random split
69
+ • uc50: UniClust50 (50% sequence similarity-based split)
70
+ • uc30: UniClust30 (30% sequence similarity-based split)
71
+ • mm15: Maximum 15% sequence identity between train and test sets (MMseqs2-based split)
72
+
73
  - Metrics (fam_ prefix):
74
  • fam_nc_*_ave: Metrics for family classification on the non-clustered (nc) dataset
75
  • fam_uc50_*_ave: Metrics for family classification on the UniRef50 clustered (uc50) dataset
 
79
  4. Protein–Protein Binding Affinity Estimation:
80
  - This benchmark evaluates models' ability to predict the change in binding affinities between proteins due to mutations. The dataset used is the **SKEMPI** dataset, which contains experimentally determined binding affinities.
81
  - The task measures how well models can extract critical structural features important for protein-protein interactions.
82
+ - Evaluation is performed through 10-fold cross-validation.
83
+
84
  - Metrics (aff_ prefix):
85
  • aff_mse_ave: Mean Squared Error for binding affinity prediction
86
  • aff_mae_ave: Mean Absolute Error for binding affinity prediction