VenusFactory Training Module User Guide

1. Introduction

The VenusFactory Training Module is a powerful tool that allows you to train custom models using protein sequence data. These models can predict various protein properties such as subcellular localization, function, stability, and more. The training module provides an intuitive interface that enables biological researchers to train high-performance protein prediction models without programming knowledge.

2. Supported Protein Language Models

VenusFactory supports various advanced protein language models. You can choose the appropriate model based on your task requirements and computational resources.

Model Name	Model Parameter Size	Number of Models	Model Example
ESM2	8M/35M/150M/650M/3B/15B	6	facebook/esm2_t33_650M_UR50D
ESM-1b	650M	1	facebook/esm1b_t33_650M_UR50S
ESM-1v	650M	5	facebook/esm1v_t33_650M_UR90S_1
ProtBert-Uniref100	420M	1	Rostlab/prot_bert_bfd
ProtBert-BFD100	420M	1	Rostlab/prot_bert_bfd
IgBert	420M	1	Exscientia/IgBert
IgBert_unpaired	420M	1	Exscientia/IgBert_unpaired
ProtT5-Uniref50	3B/11B	2	Rostlab/prot_t5_xl_uniref50
ProtT5-BFD100	3B/11B	2	Rostlab/prot_t5_xl_bfd
IgT5	3B	1	Exscientia/IgT5
IgT5_unpaired	3B	1	Exscientia/IgT5_unpaired
Ankh	450M/1.2B	2	ElnaggarLab/ankh-base
ProSST	110M	7	AI4Protein/ProSST-2048
ProPrime	690M	1	AI4Protein/Prime_690M

3. Supported Fine-tuning Methods

VenusFactory provides multiple training methods, each with specific advantages and applicable scenarios.

Fine-tuning Method	Description	Data Type
Freeze	Freezes the pre-trained model, training only the classifier	Sequence information
Full	Full parameter fine-tuning, training all parameters	Sequence information
LoRA	Uses Low-Rank Adaptation method to reduce parameter count	Sequence information
DoRA	Uses Weight-Decomposed Low-Rank Adaptation method	Sequence information
AdaLoRA	Uses Adaptive Low-Rank Adaptation method	Sequence information
IA3	Uses Infused Adapter by Inhibiting and Amplifying Inner Activations method	Sequence information
QLoRA	Uses Quantized Low-Rank Adaptation method to reduce memory requirements	Sequence information
SES-Adapter	Uses Structure-Enhanced Sequence Adapter, integrating sequence and structure information	Sequence & Structure information

4. Supported Evaluation Metrics

VenusFactory provides multiple evaluation metrics to assess model performance.

Abbreviation	Metric Name	Applicable Problem Types	Description	Optimization Direction
Accuracy	Accuracy	Single-label/Multi-label classification	Proportion of correctly predicted samples, suitable for balanced datasets	Higher is better
Recall	Recall	Single-label/Multi-label classification	Proportion of correctly identified positive classes, focuses on reducing false negatives	Higher is better
Precision	Precision	Single-label/Multi-label classification	Proportion of correctly predicted positive classes, focuses on reducing false positives	Higher is better
F1	F1 Score	Single-label/Multi-label classification	Harmonic mean of precision and recall, suitable for imbalanced datasets	Higher is better
MCC	Matthews Correlation Coefficient	Single-label/Multi-label classification	Metric that considers all confusion matrix elements, fairer for imbalanced datasets	Higher is better
AUROC	Area Under ROC Curve	Single-label/Multi-label classification	Evaluates classification performance at different thresholds	Higher is better
F1_max	Maximum F1 Score	Multi-label classification	Maximum F1 value at different thresholds, suitable for multi-label classification	Higher is better
Spearman_corr	Spearman Correlation Coefficient	Regression	Evaluates the monotonic relationship between predicted and true values, range [-1,1]	Higher is better
MSE	Mean Squared Error	Regression	Evaluates prediction error of regression models	Lower is better

5. Training Interface Details

The training interface is divided into several main sections, each containing specific configuration options.

5.1 Model and Dataset Configuration

Protein Language Model Selection

Protein Language Model: Select a pre-trained model from the dropdown menu
- Consider your computational resources and task complexity when selecting
- Larger models require more computational resources

Dataset Selection

Dataset Selection: Choose the dataset source
- Use Pre-defined Dataset: Use system-defined datasets
  - Dataset Configuration: Select a dataset from the dropdown menu
  - The system will automatically load the problem type, number of labels, and evaluation metrics
- Use Custom Dataset: Use a custom dataset
  - Custom Dataset Path: Enter the Hugging Face dataset path (format: username/dataset_name)
  - Problem Type: Select the problem type
    - single_label_classification: Single-label classification
    - multi_label_classification: Multi-label classification
    - regression: Regression
  - Number of Labels: Set the number of labels (for classification problems)
  - Metrics: Select evaluation metrics (multiple selections allowed)
    - accuracy: Accuracy
    - f1: F1 Score
    - precision: Precision
    - recall: Recall
    - mcc: Matthews Correlation Coefficient
    - auroc: Area Under the ROC Curve
    - f1max: Maximum F1 Score
    - spearman_corr: Spearman Correlation Coefficient
    - mse: Mean Squared Error
    For more information, refer to 4. Supported Evaluation Metrics

Dataset Preview

Preview Dataset: Click this button to preview the selected dataset
- Displays dataset statistics: number of samples in training, validation, and test sets
- Displays dataset examples: including sequences and labels

5.2 Training Method Configuration

Training Method: Select the training method
- freeze: Freeze the pre-trained model, train only the classifier
- full: Full parameter fine-tuning, train all parameters
- plm-lora: Use LoRA (Low-Rank Adaptation) method to reduce parameter count
- dora: Use DoRA (Weight-Decomposed Low-Rank Adaptation) method
- adalora: Use AdaLoRA (Adaptive Low-Rank Adaptation) method
- ia3: Use IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) method
- plm-qlora: Use QLoRA (Quantized Low-Rank Adaptation) method to reduce memory requirements
- ses-adapter: Use Structure-Enhanced Sequence Adapter, integrating sequence and structure information
For more information, refer to 3. Supported Fine-tuning Methods
Pooling Method: Select the pooling method
- mean: Mean pooling
- attention1d: Attention pooling
- light_attention: Lightweight attention pooling
Structure Sequence (visible when ses-adapter is selected):
- Select structure sequence types (multiple selections allowed), default is foldseek_seq and ss8_seq
LoRA Parameters (visible when plm-lora or plm-qlora is selected):
- LoRA Rank: The rank of LoRA, default is 8, affects parameter count and performance
- LoRA Alpha: The alpha value of LoRA, default is 32, affects scaling factor
- LoRA Dropout: The dropout rate of LoRA, default is 0.1, affects regularization
- LoRA Target Modules: Target modules for LoRA application, default is query,key,value

5.3 Batch Processing Configuration

Batch Processing Mode: Select the batch processing mode
- Batch Size Mode: Fixed batch size
  - Batch Size: Set the number of samples per batch, default is 16
- Batch Token Mode: Fixed token count
  - Tokens per Batch: Set the number of tokens per batch, default is 10000
  - Suitable for datasets with large variations in sequence length

5.4 Training Parameters

Learning Rate: Learning rate, default is 5e-4
- Affects the step size of model training; larger values may cause non-convergence, smaller values may cause slow training
Number of Epochs: Number of training epochs, default is 100
- Number of complete passes through the dataset
- Actual training may end earlier due to early stopping
Early Stopping Patience: Early stopping patience N, default is 10
- Training will stop early if validation performance does not improve for N consecutive epochs
Max Sequence Length: Maximum sequence length, default is None (-1 indicates no limit)
- Maximum protein sequence length to process
Scheduler Type: Learning rate scheduler type
- linear: Linear decay
- cosine: Cosine decay
- step: Step decay
- None: No scheduler
Warmup Steps: Number of warmup steps, default is 0
- Number of steps where the learning rate gradually increases from a small value to the set value
- Helps stabilize early training
Gradient Accumulation Steps: Number of gradient accumulation steps, default is 1
- Accumulates gradients from multiple batches before updating the model
- Can simulate larger batch sizes
Max Gradient Norm: Gradient clipping threshold, default is -1 (no clipping)
- Limits the maximum norm of gradients to prevent gradient explosion
- Recommended range: 1.0 to 5.0
Number of Workers: Number of data loading worker threads, default is 4
- Affects data loading speed
- Adjust based on CPU core count

5.5 Output and Logging Settings

Save Directory: Save directory, default is ckpt
- Path to save model and training results
Output Model Name: Output model name, default is model.pt
- Filename of the saved model
Enable W&B Logging: Whether to enable Weights & Biases logging
- When checked, you can set W&B project name and entity
- Used for experiment tracking and visualization

5.6 Training Control and Output

Preview Command: Preview the training command to be executed
- Click to display the complete command line arguments
Abort: Abort the current training process
Start: Start the training process
Model Statistics: Display model parameter statistics
- Parameter counts for the training model, pre-trained model, and combined model
- Percentage of trainable parameters
Training Progress: Display training progress
- Current phase (training, validation, testing)
- Progress percentage
- Time elapsed and estimated time remaining
- Current loss value and gradient steps
Best Performance: Display best model information
- Best epoch and corresponding evaluation metrics
Training and Validation Loss: Loss curve graph
- Training loss and validation loss over time
Validation Metrics: Validation set evaluation metrics graph
- Various evaluation metrics over time
Test Results: Test results
- Final performance metrics on the test set
- Evaluation metrics can be downloaded in CSV format

6. Training Process Guide

Below is a complete guide to using the VenusFactory training module, from data preparation to model evaluation.

6.1 Preparing the Dataset

Using Pre-defined Datasets

Select "Use Pre-defined Dataset" in Dataset Selection
Choose a dataset from the Dataset Configuration dropdown menu
Click the Preview Dataset button to view dataset statistics and examples

Using Custom Datasets

Prepare a dataset that meets the requirements and upload it to Hugging Face (see Custom Dataset Format Requirements)
Select "Use Custom Dataset" in Dataset Selection
Enter the Hugging Face dataset path in Custom Dataset Path (format: username/dataset_name)
Set Problem Type, Number of Labels, and Metrics
Click the Preview Dataset button to verify that the dataset is loaded correctly

6.2 Selecting a Model and Training Method

Choose a pre-trained model from the Protein Language Model dropdown menu
Select an appropriate Training Method
Choose a Pooling Method
If selecting ses-adapter, ensure you specify structure sequence types in Structure Sequence
If selecting plm-lora or plm-qlora, adjust LoRA parameters as needed

6.3 Configuring Batch Processing and Training Parameters

Select Batch Processing Mode
- Use Batch Size Mode when sequence lengths are similar
- Use Batch Token Mode when sequence lengths vary significantly
Set batch size or token count
- Adjust based on GPU memory; reduce if memory errors occur
Set Learning Rate
Set Number of Epochs
- Use early stopping mechanism; set Early Stopping Patience to 10-20 to prevent overfitting
Set Max Sequence Length
Adjust advanced parameters as needed
- Scheduler Type: Recommend using linear or cosine
- Warmup Steps: Recommend setting to 5-10% of total steps
- Gradient Accumulation Steps: Increase if memory is insufficient
- Max Gradient Norm: Set to 1.0-5.0 if training is unstable

6.4 Setting Output and Logging

Set Save Directory as the path to save the model
Set Output Model Name as the model filename
If you need to track training, check Enable W&B Logging and set project information

6.5 Starting Training

Click Preview Command to preview the training command
Click the Start button to begin training
Observe training progress and metric changes
After training is complete, view the test results
- Check various evaluation metrics
- Download results in CSV format if needed
To stop training, click the Abort button

7. Custom Dataset Format Requirements

To use a custom dataset, you need to upload the dataset to the Hugging Face platform and ensure it meets the following format requirements.

7.1 Basic Requirements

The dataset must include train, validation, and test subsets
Each sample must contain the following fields:
- aa_seq: Amino acid sequence using standard single-letter codes
- label: Label, format depends on the problem type

7.2 Label Formats for Different Problem Types

Single-label Classification (single_label_classification)

label: Integer value representing the class index (starting from 0)
Example: 0, 1, 2, ...

CSV format example:

aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,1
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,0
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,2

Multi-label Classification (multi_label_classification)

label: String of comma-separated class indices representing present classes
Example: "373,449,584,674,780,883,897,911,1048,1073,1130,1234"

CSV format example:

aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,"373,449,584,674,780,883,897,911,1048,1073,1130,1234"
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,"15,42,87,103,256"
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,"7,98,120,256,512,789"

Regression (regression)

label: Floating-point number representing a continuous value
Examples: 0.75, -1.2, ...

CSV format example:

aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,0.75
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,-1.2
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,3.45

7.3 Structural Information (Optional)

If using the ses-adapter training method, you can add the following structural information fields:

foldseek_seq: FoldSeek structure sequence, using single-letter codes to represent structural elements
ss8_seq: 8-class secondary structure sequence, using single-letter codes to represent secondary structures

CSV format example:

name,aa_seq,labelname,aa_seq,foldseek_seq,ss8_seq,label
Q9LSD8,MPEEDLVELKFRLYDGSDVGPFQYSPTATVSMLKERIVSEWPKDKKIVPKSASDIKLINAGKILENGKTVAQCKAPFDDLPKSVITMHVVVQLSPTKARPEKKIEKEEAPQRSFCSCTIM,DPPQLWAFAWEAEPVRDIDDRDTDHQQQFLLVVLQVCLVRPDPPDPDHAPHSVQKWKDDPNDTGDRNDGNNRRDDPPDDDSPDHHYIYIDGRDPPVVPPVPPPPPPPPPPPPPPPPPPPD,LLLLLLEEEEEELTTSLEEEEEEELTTLBHHHHHHHHHHTLLTTLSSLLSSGGGEEEEETTEELLTTLBHHHHLLLLLLLTTLLEEEEEEELLLLLLLLLLLLLLLLLLLLLLLLLLLLL,0

7.4 Uploading to Hugging Face

Create separate CSV files for training, validation, and test sets:
- train.csv: Training data
- validation.csv: Validation data
- test.csv: Test data
Upload the dataset to Hugging Face

The relevant steps are shown in the following images:

After uploading, use Owner/Dataset name as the Custom Dataset Path in VenusFactory