VenusFactory / src /web /manual /TrainingManual_EN.md
2dogey's picture
Upload folder using huggingface_hub
8918ac7 verified

A newer version of the Gradio SDK is available: 5.29.0

Upgrade

VenusFactory Training Module User Guide

1. Introduction

The VenusFactory Training Module is a powerful tool that allows you to train custom models using protein sequence data. These models can predict various protein properties such as subcellular localization, function, stability, and more. The training module provides an intuitive interface that enables biological researchers to train high-performance protein prediction models without programming knowledge.

2. Supported Protein Language Models

VenusFactory supports various advanced protein language models. You can choose the appropriate model based on your task requirements and computational resources.

Model Name Model Parameter Size Number of Models Model Example
ESM2 8M/35M/150M/650M/3B/15B 6 facebook/esm2_t33_650M_UR50D
ESM-1b 650M 1 facebook/esm1b_t33_650M_UR50S
ESM-1v 650M 5 facebook/esm1v_t33_650M_UR90S_1
ProtBert-Uniref100 420M 1 Rostlab/prot_bert_bfd
ProtBert-BFD100 420M 1 Rostlab/prot_bert_bfd
IgBert 420M 1 Exscientia/IgBert
IgBert_unpaired 420M 1 Exscientia/IgBert_unpaired
ProtT5-Uniref50 3B/11B 2 Rostlab/prot_t5_xl_uniref50
ProtT5-BFD100 3B/11B 2 Rostlab/prot_t5_xl_bfd
IgT5 3B 1 Exscientia/IgT5
IgT5_unpaired 3B 1 Exscientia/IgT5_unpaired
Ankh 450M/1.2B 2 ElnaggarLab/ankh-base
ProSST 110M 7 AI4Protein/ProSST-2048
ProPrime 690M 1 AI4Protein/Prime_690M

3. Supported Fine-tuning Methods

VenusFactory provides multiple training methods, each with specific advantages and applicable scenarios.

Fine-tuning Method Description Data Type
Freeze Freezes the pre-trained model, training only the classifier Sequence information
Full Full parameter fine-tuning, training all parameters Sequence information
LoRA Uses Low-Rank Adaptation method to reduce parameter count Sequence information
DoRA Uses Weight-Decomposed Low-Rank Adaptation method Sequence information
AdaLoRA Uses Adaptive Low-Rank Adaptation method Sequence information
IA3 Uses Infused Adapter by Inhibiting and Amplifying Inner Activations method Sequence information
QLoRA Uses Quantized Low-Rank Adaptation method to reduce memory requirements Sequence information
SES-Adapter Uses Structure-Enhanced Sequence Adapter, integrating sequence and structure information Sequence & Structure information

4. Supported Evaluation Metrics

VenusFactory provides multiple evaluation metrics to assess model performance.

Abbreviation Metric Name Applicable Problem Types Description Optimization Direction
Accuracy Accuracy Single-label/Multi-label classification Proportion of correctly predicted samples, suitable for balanced datasets Higher is better
Recall Recall Single-label/Multi-label classification Proportion of correctly identified positive classes, focuses on reducing false negatives Higher is better
Precision Precision Single-label/Multi-label classification Proportion of correctly predicted positive classes, focuses on reducing false positives Higher is better
F1 F1 Score Single-label/Multi-label classification Harmonic mean of precision and recall, suitable for imbalanced datasets Higher is better
MCC Matthews Correlation Coefficient Single-label/Multi-label classification Metric that considers all confusion matrix elements, fairer for imbalanced datasets Higher is better
AUROC Area Under ROC Curve Single-label/Multi-label classification Evaluates classification performance at different thresholds Higher is better
F1_max Maximum F1 Score Multi-label classification Maximum F1 value at different thresholds, suitable for multi-label classification Higher is better
Spearman_corr Spearman Correlation Coefficient Regression Evaluates the monotonic relationship between predicted and true values, range [-1,1] Higher is better
MSE Mean Squared Error Regression Evaluates prediction error of regression models Lower is better

5. Training Interface Details

The training interface is divided into several main sections, each containing specific configuration options.

5.1 Model and Dataset Configuration

Protein Language Model Selection

  • Protein Language Model: Select a pre-trained model from the dropdown menu
    • Consider your computational resources and task complexity when selecting
    • Larger models require more computational resources

Dataset Selection

  • Dataset Selection: Choose the dataset source
    • Use Pre-defined Dataset: Use system-defined datasets
      • Dataset Configuration: Select a dataset from the dropdown menu
      • The system will automatically load the problem type, number of labels, and evaluation metrics
    • Use Custom Dataset: Use a custom dataset
      • Custom Dataset Path: Enter the Hugging Face dataset path (format: username/dataset_name)

      • Problem Type: Select the problem type

        • single_label_classification: Single-label classification
        • multi_label_classification: Multi-label classification
        • regression: Regression
      • Number of Labels: Set the number of labels (for classification problems)

      • Metrics: Select evaluation metrics (multiple selections allowed)

        • accuracy: Accuracy
        • f1: F1 Score
        • precision: Precision
        • recall: Recall
        • mcc: Matthews Correlation Coefficient
        • auroc: Area Under the ROC Curve
        • f1max: Maximum F1 Score
        • spearman_corr: Spearman Correlation Coefficient
        • mse: Mean Squared Error

        For more information, refer to 4. Supported Evaluation Metrics

Dataset Preview

  • Preview Dataset: Click this button to preview the selected dataset
    • Displays dataset statistics: number of samples in training, validation, and test sets
    • Displays dataset examples: including sequences and labels

5.2 Training Method Configuration

  • Training Method: Select the training method

    • freeze: Freeze the pre-trained model, train only the classifier
    • full: Full parameter fine-tuning, train all parameters
    • plm-lora: Use LoRA (Low-Rank Adaptation) method to reduce parameter count
    • dora: Use DoRA (Weight-Decomposed Low-Rank Adaptation) method
    • adalora: Use AdaLoRA (Adaptive Low-Rank Adaptation) method
    • ia3: Use IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) method
    • plm-qlora: Use QLoRA (Quantized Low-Rank Adaptation) method to reduce memory requirements
    • ses-adapter: Use Structure-Enhanced Sequence Adapter, integrating sequence and structure information

    For more information, refer to 3. Supported Fine-tuning Methods

  • Pooling Method: Select the pooling method

    • mean: Mean pooling
    • attention1d: Attention pooling
    • light_attention: Lightweight attention pooling
  • Structure Sequence (visible when ses-adapter is selected):

    • Select structure sequence types (multiple selections allowed), default is foldseek_seq and ss8_seq
  • LoRA Parameters (visible when plm-lora or plm-qlora is selected):

    • LoRA Rank: The rank of LoRA, default is 8, affects parameter count and performance
    • LoRA Alpha: The alpha value of LoRA, default is 32, affects scaling factor
    • LoRA Dropout: The dropout rate of LoRA, default is 0.1, affects regularization
    • LoRA Target Modules: Target modules for LoRA application, default is query,key,value

5.3 Batch Processing Configuration

  • Batch Processing Mode: Select the batch processing mode
    • Batch Size Mode: Fixed batch size
      • Batch Size: Set the number of samples per batch, default is 16
    • Batch Token Mode: Fixed token count
      • Tokens per Batch: Set the number of tokens per batch, default is 10000
      • Suitable for datasets with large variations in sequence length

5.4 Training Parameters

  • Learning Rate: Learning rate, default is 5e-4

    • Affects the step size of model training; larger values may cause non-convergence, smaller values may cause slow training
  • Number of Epochs: Number of training epochs, default is 100

    • Number of complete passes through the dataset
    • Actual training may end earlier due to early stopping
  • Early Stopping Patience: Early stopping patience N, default is 10

    • Training will stop early if validation performance does not improve for N consecutive epochs
  • Max Sequence Length: Maximum sequence length, default is None (-1 indicates no limit)

    • Maximum protein sequence length to process
  • Scheduler Type: Learning rate scheduler type

    • linear: Linear decay
    • cosine: Cosine decay
    • step: Step decay
    • None: No scheduler
  • Warmup Steps: Number of warmup steps, default is 0

    • Number of steps where the learning rate gradually increases from a small value to the set value
    • Helps stabilize early training
  • Gradient Accumulation Steps: Number of gradient accumulation steps, default is 1

    • Accumulates gradients from multiple batches before updating the model
    • Can simulate larger batch sizes
  • Max Gradient Norm: Gradient clipping threshold, default is -1 (no clipping)

    • Limits the maximum norm of gradients to prevent gradient explosion
    • Recommended range: 1.0 to 5.0
  • Number of Workers: Number of data loading worker threads, default is 4

    • Affects data loading speed
    • Adjust based on CPU core count

5.5 Output and Logging Settings

  • Save Directory: Save directory, default is ckpt

    • Path to save model and training results
  • Output Model Name: Output model name, default is model.pt

    • Filename of the saved model
  • Enable W&B Logging: Whether to enable Weights & Biases logging

    • When checked, you can set W&B project name and entity
    • Used for experiment tracking and visualization

5.6 Training Control and Output

  • Preview Command: Preview the training command to be executed

    • Click to display the complete command line arguments
  • Abort: Abort the current training process

  • Start: Start the training process

  • Model Statistics: Display model parameter statistics

    • Parameter counts for the training model, pre-trained model, and combined model
    • Percentage of trainable parameters
  • Training Progress: Display training progress

    • Current phase (training, validation, testing)
    • Progress percentage
    • Time elapsed and estimated time remaining
    • Current loss value and gradient steps
  • Best Performance: Display best model information

    • Best epoch and corresponding evaluation metrics
  • Training and Validation Loss: Loss curve graph

    • Training loss and validation loss over time
  • Validation Metrics: Validation set evaluation metrics graph

    • Various evaluation metrics over time
  • Test Results: Test results

    • Final performance metrics on the test set
    • Evaluation metrics can be downloaded in CSV format

6. Training Process Guide

Below is a complete guide to using the VenusFactory training module, from data preparation to model evaluation.

6.1 Preparing the Dataset

Using Pre-defined Datasets

  1. Select "Use Pre-defined Dataset" in Dataset Selection
  2. Choose a dataset from the Dataset Configuration dropdown menu
  3. Click the Preview Dataset button to view dataset statistics and examples

Using Custom Datasets

  1. Prepare a dataset that meets the requirements and upload it to Hugging Face (see Custom Dataset Format Requirements)
  2. Select "Use Custom Dataset" in Dataset Selection
  3. Enter the Hugging Face dataset path in Custom Dataset Path (format: username/dataset_name)
  4. Set Problem Type, Number of Labels, and Metrics
  5. Click the Preview Dataset button to verify that the dataset is loaded correctly

6.2 Selecting a Model and Training Method

  1. Choose a pre-trained model from the Protein Language Model dropdown menu

  2. Select an appropriate Training Method

  3. Choose a Pooling Method

  4. If selecting ses-adapter, ensure you specify structure sequence types in Structure Sequence

  5. If selecting plm-lora or plm-qlora, adjust LoRA parameters as needed

6.3 Configuring Batch Processing and Training Parameters

  1. Select Batch Processing Mode

    • Use Batch Size Mode when sequence lengths are similar
    • Use Batch Token Mode when sequence lengths vary significantly
  2. Set batch size or token count

    • Adjust based on GPU memory; reduce if memory errors occur
  3. Set Learning Rate

  4. Set Number of Epochs

    • Use early stopping mechanism; set Early Stopping Patience to 10-20 to prevent overfitting
  5. Set Max Sequence Length

  6. Adjust advanced parameters as needed

    • Scheduler Type: Recommend using linear or cosine
    • Warmup Steps: Recommend setting to 5-10% of total steps
    • Gradient Accumulation Steps: Increase if memory is insufficient
    • Max Gradient Norm: Set to 1.0-5.0 if training is unstable

6.4 Setting Output and Logging

  1. Set Save Directory as the path to save the model
  2. Set Output Model Name as the model filename
  3. If you need to track training, check Enable W&B Logging and set project information

6.5 Starting Training

  1. Click Preview Command to preview the training command
  2. Click the Start button to begin training
  3. Observe training progress and metric changes
  4. After training is complete, view the test results
    • Check various evaluation metrics
    • Download results in CSV format if needed
  5. To stop training, click the Abort button

7. Custom Dataset Format Requirements

To use a custom dataset, you need to upload the dataset to the Hugging Face platform and ensure it meets the following format requirements.

7.1 Basic Requirements

  • The dataset must include train, validation, and test subsets
  • Each sample must contain the following fields:
    • aa_seq: Amino acid sequence using standard single-letter codes
    • label: Label, format depends on the problem type

7.2 Label Formats for Different Problem Types

Single-label Classification (single_label_classification)

  • label: Integer value representing the class index (starting from 0)
  • Example: 0, 1, 2, ...

CSV format example:

aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,1
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,0
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,2

Multi-label Classification (multi_label_classification)

  • label: String of comma-separated class indices representing present classes
  • Example: "373,449,584,674,780,883,897,911,1048,1073,1130,1234"

CSV format example:

aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,"373,449,584,674,780,883,897,911,1048,1073,1130,1234"
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,"15,42,87,103,256"
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,"7,98,120,256,512,789"

Regression (regression)

  • label: Floating-point number representing a continuous value
  • Examples: 0.75, -1.2, ...

CSV format example:

aa_seq,label
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG,0.75
MLKFQQFGKGVLTEQKHALSELVCGLLEGRPFSQHEKETITIGIINIANNNDLFSAYK,-1.2
MSDKIIHLTDDSFDTDVLKADGAILVDFWAEWCGPCKMIAPILDEIADEYQGKLTVAK,3.45

7.3 Structural Information (Optional)

If using the ses-adapter training method, you can add the following structural information fields:

  • foldseek_seq: FoldSeek structure sequence, using single-letter codes to represent structural elements
  • ss8_seq: 8-class secondary structure sequence, using single-letter codes to represent secondary structures

CSV format example:

name,aa_seq,labelname,aa_seq,foldseek_seq,ss8_seq,label
Q9LSD8,MPEEDLVELKFRLYDGSDVGPFQYSPTATVSMLKERIVSEWPKDKKIVPKSASDIKLINAGKILENGKTVAQCKAPFDDLPKSVITMHVVVQLSPTKARPEKKIEKEEAPQRSFCSCTIM,DPPQLWAFAWEAEPVRDIDDRDTDHQQQFLLVVLQVCLVRPDPPDPDHAPHSVQKWKDDPNDTGDRNDGNNRRDDPPDDDSPDHHYIYIDGRDPPVVPPVPPPPPPPPPPPPPPPPPPPD,LLLLLLEEEEEELTTSLEEEEEEELTTLBHHHHHHHHHHTLLTTLSSLLSSGGGEEEEETTEELLTTLBHHHHLLLLLLLTTLLEEEEEEELLLLLLLLLLLLLLLLLLLLLLLLLLLLL,0

7.4 Uploading to Hugging Face

  1. Create separate CSV files for training, validation, and test sets:

    • train.csv: Training data
    • validation.csv: Validation data
    • test.csv: Test data
  2. Upload the dataset to Hugging Face

  • The relevant steps are shown in the following images:

HF1 HF2 HF3 HF4

After uploading, use Owner/Dataset name as the Custom Dataset Path in VenusFactory