Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.29.0
VenusFactory Training Module User Guide
1. Introduction
The VenusFactory Training Module is a powerful tool that allows you to train custom models using protein sequence data. These models can predict various protein properties such as subcellular localization, function, stability, and more. The training module provides an intuitive interface that enables biological researchers to train high-performance protein prediction models without programming knowledge.
2. Supported Protein Language Models
VenusFactory supports various advanced protein language models. You can choose the appropriate model based on your task requirements and computational resources.
Model Name | Model Parameter Size | Number of Models | Model Example |
---|---|---|---|
ESM2 | 8M/35M/150M/650M/3B/15B | 6 | facebook/esm2_t33_650M_UR50D |
ESM-1b | 650M | 1 | facebook/esm1b_t33_650M_UR50S |
ESM-1v | 650M | 5 | facebook/esm1v_t33_650M_UR90S_1 |
ProtBert-Uniref100 | 420M | 1 | Rostlab/prot_bert_bfd |
ProtBert-BFD100 | 420M | 1 | Rostlab/prot_bert_bfd |
IgBert | 420M | 1 | Exscientia/IgBert |
IgBert_unpaired | 420M | 1 | Exscientia/IgBert_unpaired |
ProtT5-Uniref50 | 3B/11B | 2 | Rostlab/prot_t5_xl_uniref50 |
ProtT5-BFD100 | 3B/11B | 2 | Rostlab/prot_t5_xl_bfd |
IgT5 | 3B | 1 | Exscientia/IgT5 |
IgT5_unpaired | 3B | 1 | Exscientia/IgT5_unpaired |
Ankh | 450M/1.2B | 2 | ElnaggarLab/ankh-base |
ProSST | 110M | 7 | AI4Protein/ProSST-2048 |
ProPrime | 690M | 1 | AI4Protein/Prime_690M |
3. Supported Fine-tuning Methods
VenusFactory provides multiple training methods, each with specific advantages and applicable scenarios.
Fine-tuning Method | Description | Data Type |
---|---|---|
Freeze | Freezes the pre-trained model, training only the classifier | Sequence information |
Full | Full parameter fine-tuning, training all parameters | Sequence information |
LoRA | Uses Low-Rank Adaptation method to reduce parameter count | Sequence information |
DoRA | Uses Weight-Decomposed Low-Rank Adaptation method | Sequence information |
AdaLoRA | Uses Adaptive Low-Rank Adaptation method | Sequence information |
IA3 | Uses Infused Adapter by Inhibiting and Amplifying Inner Activations method | Sequence information |
QLoRA | Uses Quantized Low-Rank Adaptation method to reduce memory requirements | Sequence information |
SES-Adapter | Uses Structure-Enhanced Sequence Adapter, integrating sequence and structure information | Sequence & Structure information |
4. Supported Evaluation Metrics
VenusFactory provides multiple evaluation metrics to assess model performance.
Abbreviation | Metric Name | Applicable Problem Types | Description | Optimization Direction |
---|---|---|---|---|
Accuracy | Accuracy | Single-label/Multi-label classification | Proportion of correctly predicted samples, suitable for balanced datasets | Higher is better |
Recall | Recall | Single-label/Multi-label classification | Proportion of correctly identified positive classes, focuses on reducing false negatives | Higher is better |
Precision | Precision | Single-label/Multi-label classification | Proportion of correctly predicted positive classes, focuses on reducing false positives | Higher is better |
F1 | F1 Score | Single-label/Multi-label classification | Harmonic mean of precision and recall, suitable for imbalanced datasets | Higher is better |
MCC | Matthews Correlation Coefficient | Single-label/Multi-label classification | Metric that considers all confusion matrix elements, fairer for imbalanced datasets | Higher is better |
AUROC | Area Under ROC Curve | Single-label/Multi-label classification | Evaluates classification performance at different thresholds | Higher is better |
F1_max | Maximum F1 Score | Multi-label classification | Maximum F1 value at different thresholds, suitable for multi-label classification | Higher is better |
Spearman_corr | Spearman Correlation Coefficient | Regression | Evaluates the monotonic relationship between predicted and true values, range [-1,1] | Higher is better |
MSE | Mean Squared Error | Regression | Evaluates prediction error of regression models | Lower is better |
5. Training Interface Details
The training interface is divided into several main sections, each containing specific configuration options.
5.1 Model and Dataset Configuration
Protein Language Model Selection
- Protein Language Model: Select a pre-trained model from the dropdown menu
- Consider your computational resources and task complexity when selecting
- Larger models require more computational resources
Dataset Selection
- Dataset Selection: Choose the dataset source
- Use Pre-defined Dataset: Use system-defined datasets
- Dataset Configuration: Select a dataset from the dropdown menu
- The system will automatically load the problem type, number of labels, and evaluation metrics
- Use Custom Dataset: Use a custom dataset
Custom Dataset Path: Enter the Hugging Face dataset path (format:
username/dataset_name
)Problem Type: Select the problem type
single_label_classification
: Single-label classificationmulti_label_classification
: Multi-label classificationregression
: Regression
Number of Labels: Set the number of labels (for classification problems)
Metrics: Select evaluation metrics (multiple selections allowed)
accuracy
: Accuracyf1
: F1 Scoreprecision
: Precisionrecall
: Recallmcc
: Matthews Correlation Coefficientauroc
: Area Under the ROC Curvef1max
: Maximum F1 Scorespearman_corr
: Spearman Correlation Coefficientmse
: Mean Squared Error
For more information, refer to 4. Supported Evaluation Metrics
- Use Pre-defined Dataset: Use system-defined datasets
Dataset Preview
- Preview Dataset: Click this button to preview the selected dataset
- Displays dataset statistics: number of samples in training, validation, and test sets
- Displays dataset examples: including sequences and labels
5.2 Training Method Configuration
Training Method: Select the training method
freeze
: Freeze the pre-trained model, train only the classifierfull
: Full parameter fine-tuning, train all parametersplm-lora
: Use LoRA (Low-Rank Adaptation) method to reduce parameter countdora
: Use DoRA (Weight-Decomposed Low-Rank Adaptation) methodadalora
: Use AdaLoRA (Adaptive Low-Rank Adaptation) methodia3
: Use IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) methodplm-qlora
: Use QLoRA (Quantized Low-Rank Adaptation) method to reduce memory requirementsses-adapter
: Use Structure-Enhanced Sequence Adapter, integrating sequence and structure information
For more information, refer to 3. Supported Fine-tuning Methods
Pooling Method: Select the pooling method
mean
: Mean poolingattention1d
: Attention poolinglight_attention
: Lightweight attention pooling
Structure Sequence (visible when
ses-adapter
is selected):- Select structure sequence types (multiple selections allowed), default is
foldseek_seq
andss8_seq
- Select structure sequence types (multiple selections allowed), default is
LoRA Parameters (visible when
plm-lora
orplm-qlora
is selected):- LoRA Rank: The rank of LoRA, default is 8, affects parameter count and performance
- LoRA Alpha: The alpha value of LoRA, default is 32, affects scaling factor
- LoRA Dropout: The dropout rate of LoRA, default is 0.1, affects regularization
- LoRA Target Modules: Target modules for LoRA application, default is
query,key,value
5.3 Batch Processing Configuration
- Batch Processing Mode: Select the batch processing mode
- Batch Size Mode: Fixed batch size
- Batch Size: Set the number of samples per batch, default is 16
- Batch Token Mode: Fixed token count
- Tokens per Batch: Set the number of tokens per batch, default is 10000
- Suitable for datasets with large variations in sequence length
- Batch Size Mode: Fixed batch size
5.4 Training Parameters
Learning Rate: Learning rate, default is 5e-4
- Affects the step size of model training; larger values may cause non-convergence, smaller values may cause slow training
Number of Epochs: Number of training epochs, default is 100
- Number of complete passes through the dataset
- Actual training may end earlier due to early stopping
Early Stopping Patience: Early stopping patience N, default is 10
- Training will stop early if validation performance does not improve for N consecutive epochs
Max Sequence Length: Maximum sequence length, default is None (-1 indicates no limit)
- Maximum protein sequence length to process
Scheduler Type: Learning rate scheduler type
linear
: Linear decaycosine
: Cosine decaystep
: Step decayNone
: No scheduler
Warmup Steps: Number of warmup steps, default is 0
- Number of steps where the learning rate gradually increases from a small value to the set value
- Helps stabilize early training
Gradient Accumulation Steps: Number of gradient accumulation steps, default is 1
- Accumulates gradients from multiple batches before updating the model
- Can simulate larger batch sizes
Max Gradient Norm: Gradient clipping threshold, default is -1 (no clipping)
- Limits the maximum norm of gradients to prevent gradient explosion
- Recommended range: 1.0 to 5.0
Number of Workers: Number of data loading worker threads, default is 4
- Affects data loading speed
- Adjust based on CPU core count
5.5 Output and Logging Settings
Save Directory: Save directory, default is
ckpt
- Path to save model and training results
Output Model Name: Output model name, default is
model.pt
- Filename of the saved model
Enable W&B Logging: Whether to enable Weights & Biases logging
- When checked, you can set W&B project name and entity
- Used for experiment tracking and visualization
5.6 Training Control and Output
Preview Command: Preview the training command to be executed
- Click to display the complete command line arguments
Abort: Abort the current training process
Start: Start the training process
Model Statistics: Display model parameter statistics
- Parameter counts for the training model, pre-trained model, and combined model
- Percentage of trainable parameters
Training Progress: Display training progress
- Current phase (training, validation, testing)
- Progress percentage
- Time elapsed and estimated time remaining
- Current loss value and gradient steps
Best Performance: Display best model information
- Best epoch and corresponding evaluation metrics
Training and Validation Loss: Loss curve graph
- Training loss and validation loss over time
Validation Metrics: Validation set evaluation metrics graph
- Various evaluation metrics over time
Test Results: Test results
- Final performance metrics on the test set
- Evaluation metrics can be downloaded in CSV format
6. Training Process Guide
Below is a complete guide to using the VenusFactory training module, from data preparation to model evaluation.
6.1 Preparing the Dataset
Using Pre-defined Datasets
- Select "Use Pre-defined Dataset" in Dataset Selection
- Choose a dataset from the Dataset Configuration dropdown menu
- Click the Preview Dataset button to view dataset statistics and examples
Using Custom Datasets
- Prepare a dataset that meets the requirements and upload it to Hugging Face (see Custom Dataset Format Requirements)
- Select "Use Custom Dataset" in Dataset Selection
- Enter the Hugging Face dataset path in Custom Dataset Path (format:
username/dataset_name
) - Set Problem Type, Number of Labels, and Metrics
- Click the Preview Dataset button to verify that the dataset is loaded correctly
6.2 Selecting a Model and Training Method
Choose a pre-trained model from the Protein Language Model dropdown menu
Select an appropriate Training Method
Choose a Pooling Method
If selecting
ses-adapter
, ensure you specify structure sequence types in Structure SequenceIf selecting
plm-lora
orplm-qlora
, adjust LoRA parameters as needed
6.3 Configuring Batch Processing and Training Parameters
Select Batch Processing Mode
- Use Batch Size Mode when sequence lengths are similar
- Use Batch Token Mode when sequence lengths vary significantly
Set batch size or token count
- Adjust based on GPU memory; reduce if memory errors occur
Set Learning Rate
Set Number of Epochs
- Use early stopping mechanism; set Early Stopping Patience to 10-20 to prevent overfitting
Set Max Sequence Length
Adjust advanced parameters as needed
- Scheduler Type: Recommend using
linear
orcosine
- Warmup Steps: Recommend setting to 5-10% of total steps
- Gradient Accumulation Steps: Increase if memory is insufficient
- Max Gradient Norm: Set to 1.0-5.0 if training is unstable
- Scheduler Type: Recommend using
6.4 Setting Output and Logging
- Set Save Directory as the path to save the model
- Set Output Model Name as the model filename
- If you need to track training, check Enable W&B Logging and set project information
6.5 Starting Training
- Click Preview Command to preview the training command
- Click the Start button to begin training
- Observe training progress and metric changes
- After training is complete, view the test results
- Check various evaluation metrics
- Download results in CSV format if needed
- To stop training, click the Abort button
7. Custom Dataset Format Requirements
To use a custom dataset, you need to upload the dataset to the Hugging Face platform and ensure it meets the following format requirements.
7.1 Basic Requirements
- The dataset must include
train
,validation
, andtest
subsets - Each sample must contain the following fields:
aa_seq
: Amino acid sequence using standard single-letter codeslabel
: Label, format depends on the problem type
7.2 Label Formats for Different Problem Types
Single-label Classification (single_label_classification)
label
: Integer value representing the class index (starting from 0)- Example: 0, 1, 2, ...
CSV format example:
aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,1
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,0
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,2
Multi-label Classification (multi_label_classification)
label
: String of comma-separated class indices representing present classes- Example: "373,449,584,674,780,883,897,911,1048,1073,1130,1234"
CSV format example:
aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,"373,449,584,674,780,883,897,911,1048,1073,1130,1234"
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,"15,42,87,103,256"
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,"7,98,120,256,512,789"
Regression (regression)
label
: Floating-point number representing a continuous value- Examples: 0.75, -1.2, ...
CSV format example:
aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,0.75
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,-1.2
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,3.45
7.3 Structural Information (Optional)
If using the ses-adapter
training method, you can add the following structural information fields:
foldseek_seq
: FoldSeek structure sequence, using single-letter codes to represent structural elementsss8_seq
: 8-class secondary structure sequence, using single-letter codes to represent secondary structures
CSV format example:
name,aa_seq,labelname,aa_seq,foldseek_seq,ss8_seq,label
Q9LSD8,MPEEDLVELKFRLYDGSDVGPFQYSPTATVSMLKERIVSEWPKDKKIVPKSASDIKLINAGKILENGKTVAQCKAPFDDLPKSVITMHVVVQLSPTKARPEKKIEKEEAPQRSFCSCTIM,DPPQLWAFAWEAEPVRDIDDRDTDHQQQFLLVVLQVCLVRPDPPDPDHAPHSVQKWKDDPNDTGDRNDGNNRRDDPPDDDSPDHHYIYIDGRDPPVVPPVPPPPPPPPPPPPPPPPPPPD,LLLLLLEEEEEELTTSLEEEEEEELTTLBHHHHHHHHHHTLLTTLSSLLSSGGGEEEEETTEELLTTLBHHHHLLLLLLLTTLLEEEEEEELLLLLLLLLLLLLLLLLLLLLLLLLLLLL,0
7.4 Uploading to Hugging Face
Create separate CSV files for training, validation, and test sets:
train.csv
: Training datavalidation.csv
: Validation datatest.csv
: Test data
Upload the dataset to Hugging Face
- The relevant steps are shown in the following images:
After uploading, use Owner/Dataset name
as the Custom Dataset Path in VenusFactory