VenusFactory / src /web /manual /QAManual_EN.md
2dogey's picture
Upload folder using huggingface_hub
8918ac7 verified

A newer version of the Gradio SDK is available: 5.29.0

Upgrade

ProFactory Frequently Asked Questions (FAQ)

Installation and Environment Configuration Issues

Q1: How to properly install ProFactory?

Answer: You can find the installation step in README.md at the root directory.

Q2: What should I do if I encounter the error "Could not find a specific dependency" during installation?

Answer: There are several solutions for this situation:

  1. Try installing the problematic dependency individually:

    pip install name_of_the_problematic_library
    
  2. If it is a CUDA-related library, ensure you have installed a PyTorch version compatible with your CUDA version:

    # For example, for CUDA 11.7
    pip install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
    
  3. For some special libraries, you may need to install system dependencies first. For example, on Ubuntu:

    sudo apt-get update
    sudo apt-get install build-essential
    

Q3: How can I check if my CUDA is installed correctly?

Answer: You can verify if CUDA is installed correctly by the following methods:

  1. Check the CUDA version:

    nvidia-smi
    
  2. Verify if PyTorch can recognize CUDA in Python:

    import torch
    print(torch.cuda.is_available())  # Should return True
    print(torch.cuda.device_count())  # Displays the number of GPUs
    print(torch.cuda.get_device_name(0))  # Displays the GPU name
    
  3. If PyTorch cannot recognize CUDA, ensure you have installed the matching versions of PyTorch and CUDA.

Hardware and Resource Issues

Q4: What should I do if I encounter a "CUDA out of memory" error during runtime?

Answer: This error indicates that your GPU memory is insufficient. Solutions include:

  1. Reduce the batch size: This is the most direct and effective method. Reduce the batch size in the training configuration by half or more.

  2. Use a smaller model: Choose a pre-trained model with fewer parameters, such as switching from ProtBERT to ESM-1b.

  3. Enable gradient accumulation: Increase the gradient_accumulation_steps parameter value, for example, set it to 2 or 4, which can reduce memory usage without decreasing the effective batch size.

  4. Use mixed precision training: Enable the fp16 option in the training options, which can significantly reduce memory usage.

  5. Reduce the maximum sequence length: If your data allows, you can decrease the max_seq_length parameter.

Q5: How can I determine what batch size I should use?

Answer: Determining the appropriate batch size requires balancing memory usage and training effectiveness:

  1. Start small and gradually increase: Begin with smaller values (like 4 or 8) and gradually increase until memory is close to its limit.

  2. Refer to benchmarks: For common protein models, most studies use a batch size of 16-64, but this depends on your GPU memory and sequence length.

  3. Monitor the training process: A larger batch size may make each training iteration more stable but may require a higher learning rate.

  4. Rule of thumb for memory issues: If you encounter memory errors, first try halving the batch size.

Dataset Issues

Q6: How do I prepare a custom dataset?

Answer: Preparing a custom dataset requires the following steps:

  1. Format the data: The data should be organized into a CSV file, containing at least the following columns:

    • sequence: The protein sequence, represented using standard amino acid letters
    • Label column: Depending on your task type, this can be numerical (regression) or categorical (classification)
  2. Split the data: Prepare training, validation, and test sets, such as train.csv, validation.csv, and test.csv.

  3. Upload to Hugging Face:

    • Create a dataset repository on Hugging Face
    • Upload your CSV file
    • Reference it in ProFactory using the username/dataset_name format
  4. Create dataset configuration: The configuration should include the problem type (regression or classification), number of labels, and evaluation metrics.

Q7: What should I do if I encounter a format error when importing my dataset?

Answer: Common format issues and their solutions:

  1. Incorrect column names: Ensure the CSV file contains the necessary columns, especially the sequence column and label column.

  2. Sequence format issues:

    • Ensure the sequence contains only valid amino acid letters (ACDEFGHIKLMNPQRSTVWY)
    • Remove spaces, line breaks, or other illegal characters from the sequence
    • Check if the sequence length is within a reasonable range
  3. Encoding issues: Ensure the CSV file is saved with UTF-8 encoding.

  4. CSV delimiter issues: Ensure the file uses the correct delimiter (usually a comma). You can use a text editor to view and correct it.

  5. Handling missing values: Ensure there are no missing values in the data, or handle them appropriately.

Q8: My dataset is large, and the system loads slowly or crashes. What should I do?

Answer: For large datasets, you can:

  1. Reduce the dataset size: If possible, test your method with a subset of the data first.

  2. Increase data loading efficiency:

    • Use the batch_size parameter to control the amount of data loaded at a time
    • Enable data caching to avoid repeated loading
    • Preprocess the data to reduce file size (e.g., remove unnecessary columns)
  3. Dataset sharding: Split large datasets into multiple smaller files and process them one by one.

  4. Increase system resources: If possible, increase RAM or use a server with more memory.

Training Issues

Q9: How can I recover if the training suddenly interrupts?

Answer: Methods to handle training interruptions:

  1. Check checkpoints: The system periodically saves checkpoints (usually in the ckpt directory). You can recover from the most recent checkpoint:

    • Look for the last saved model file (usually named checkpoint-X, where X is the step number)
    • Specify the checkpoint path as the starting point in the training options
  2. Use the checkpoint recovery feature: Enable the checkpoint recovery option in the training configuration.

  3. Save checkpoints more frequently: Adjust the frequency of saving checkpoints, for example, save every 500 steps instead of the default every 1000 steps.

Q10: How can I speed up training if it is very slow?

Answer: Methods to speed up training:

  1. Hardware aspects:

    • Use a more powerful GPU
    • Use multi-GPU training (if supported)
    • Ensure data is stored on an SSD rather than an HDD
  2. Parameter settings:

    • Use mixed precision training (enable the fp16 option)
    • Increase the batch size (if memory allows)
    • Reduce the maximum sequence length (if the task allows)
    • Decrease validation frequency (the eval_steps parameter)
  3. Model selection:

    • Choose a smaller pre-trained model
    • Use parameter-efficient fine-tuning methods (like LoRA)

Q11: What does it mean if the loss value does not decrease or if NaN values appear during training?

Answer: This usually indicates that there is a problem with the training:

  1. Reasons for loss not decreasing and solutions:

    • Learning rate too high: Try reducing the learning rate, for example, from 5e-5 to 1e-5
    • Optimizer issues: Try different optimizers, such as switching from Adam to AdamW
    • Initialization issues: Check the model initialization settings
    • Data issues: Validate if the training data has outliers or label errors
  2. Reasons for NaN values and solutions:

    • Gradient explosion: Add gradient clipping, set the max_grad_norm parameter
    • Learning rate too high: Significantly reduce the learning rate
    • Numerical instability: This may occur when using mixed precision training; try disabling the fp16 option
    • Data anomalies: Check if there are extreme values in the input data

Q12: What is overfitting, and how can it be avoided?

Answer: Overfitting refers to a model performing well on training data but poorly on new data. Methods to avoid overfitting include:

  1. Increase the amount of data: Use more training data or data augmentation techniques.

  2. Regularization methods:

    • Add dropout (usually set to 0.1-0.3)
    • Use weight decay
    • Early stopping: Stop training when the validation performance no longer improves
  3. Simplify the model:

    • Use fewer layers or smaller hidden dimensions
    • Freeze some layers of the pre-trained model (using the freeze method)
  4. Cross-validation: Use k-fold cross-validation to obtain a more robust model.

Evaluation Issues

Q13: How do I interpret evaluation metrics? Which metric is the most important?

Answer: Different tasks focus on different metrics:

  1. Classification tasks:

    • Accuracy: The proportion of correct predictions, suitable for balanced datasets
    • F1 Score: The harmonic mean of precision and recall, suitable for imbalanced datasets
    • MCC (Matthews Correlation Coefficient): A comprehensive measure of classification performance, more robust to class imbalance
    • AUROC (Area Under the ROC Curve): Measures the model's ability to distinguish between different classes
  2. Regression tasks:

    • MSE (Mean Squared Error): The sum of the squared differences between predicted and actual values, the smaller the better
    • RMSE (Root Mean Squared Error): The square root of MSE, in the same units as the original data
    • MAE (Mean Absolute Error): The average of the absolute differences between predicted and actual values
    • R² (Coefficient of Determination): Measures the proportion of variance explained by the model, the closer to 1 the better
  3. Most important metric: Depends on your specific application needs. For example, in drug screening, you may focus more on true positive rates; for structural prediction, you may focus more on RMSE.

Q14: What should I do if the evaluation results are poor?

Answer: Common strategies to improve model performance:

  1. Data quality:

    • Check for errors or noise in the data
    • Increase the number of training samples
    • Ensure the training and test set distributions are similar
  2. Model adjustments:

    • Try different pre-trained models
    • Adjust hyperparameters like learning rate and batch size
    • Use different fine-tuning methods (full parameter fine-tuning, LoRA, etc.)
  3. Feature engineering:

    • Add structural information (e.g., using foldseek features)
    • Consider sequence characteristics (e.g., hydrophobicity, charge, etc.)
  4. Ensemble methods:

    • Train multiple models and combine results
    • Use cross-validation to obtain a more robust model

Q15: Why does my model perform much worse on the test set than on the validation set?

Answer: Common reasons for decreased performance on the test set:

  1. Data distribution shift:

    • The training, validation, and test set distributions are inconsistent
    • The test set contains protein families or features not seen during training
  2. Overfitting:

    • The model overfits the validation set because it was used for model selection
    • Increasing regularization or reducing the number of training epochs may help
  3. Data leakage:

    • Unintentionally leaking test data information into the training process
    • Ensure data splitting is done before preprocessing to avoid cross-contamination
  4. Randomness:

    • If the test set is small, results may be influenced by randomness
    • Try training multiple models with different random seeds and averaging the results

Prediction Issues

Q16: How can I speed up the prediction process?

Answer: Methods to speed up predictions:

  1. Batch prediction: Use batch prediction mode instead of single-sequence prediction, which can utilize the GPU more efficiently.

  2. Reduce computation:

    • Use a smaller model or a more efficient fine-tuning method
    • Reduce the maximum sequence length (if possible)
  3. Hardware optimization:

    • Use a faster GPU or CPU
    • Ensure predictions are done on the GPU rather than the CPU
  4. Model optimization:

    • Try model quantization (e.g., int8 quantization)
    • Exporting to ONNX format may provide faster inference speeds

Q17: What could be the reason for the prediction results being significantly different from expectations?

Answer: Possible reasons for prediction discrepancies:

  1. Data mismatch:

    • The sequences being predicted differ from the training data distribution
    • There are significant differences in sequence length, composition, or structural features
  2. Model issues:

    • The model is under-trained or overfitted
    • An unsuitable pre-trained model was chosen for the task
  3. Parameter configuration:

    • Ensure the parameters used during prediction (like maximum sequence length) are consistent with those used during training
    • Check if the correct problem type (classification/regression) is being used
  4. Data preprocessing:

    • Ensure the prediction data undergoes the same preprocessing steps as the training data
    • Check if the sequence format is correct (standard amino acid letters, no special characters)

Q18: How can I batch predict a large number of sequences?

Answer: Steps for efficient batch prediction:

  1. Prepare the input file:

    • Create a CSV file containing all sequences
    • The file must include a sequence column
    • Optionally include an ID or other identifier columns
  2. Use the batch prediction feature:

    • Go to the prediction tab
    • Select "Batch Prediction" mode
    • Upload the sequence file
    • Set an appropriate batch size (usually 16-32 is a good balance)
  3. Optimize settings:

    • Increasing the batch size can improve throughput (if memory allows)
    • Reducing unnecessary feature calculations can speed up processing
  4. Result handling:

    • After prediction is complete, the system will generate a CSV file containing the original sequences and prediction results
    • You can download this file for further analysis

Model and Result Issues

Q19: Which pre-trained model should I choose?

Answer: Model selection recommendations:

  1. For general tasks:

    • ESM-2 is suitable for various protein-related tasks, balancing performance and efficiency
    • ProtBERT performs well on certain sequence classification tasks
  2. Considerations:

    • Data volume: When data is limited, a smaller model may be better (to avoid overfitting)
    • Sequence length: For long sequences, consider models that support longer contexts
    • Computational resources: When resources are limited, choose smaller models or parameter-efficient methods
    • Task type: Different models have their advantages in different tasks
  3. Recommended strategy: If conditions allow, try several different models and choose the one that performs best on the validation set.

Q20: How do I interpret the loss curve during training?

Answer: Guidelines for interpreting the loss curve:

  1. Ideal curve:

    • Both training loss and validation loss decrease steadily
    • The two curves eventually stabilize and converge
    • The validation loss stabilizes near its lowest point
  2. Common patterns and their meanings:

    • Training loss continues to decrease while validation loss increases: Signal of overfitting; consider increasing regularization
    • Both losses stagnate at high values: Indicates underfitting; may need a more complex model or longer training
    • Curve fluctuates dramatically: The learning rate may be too high; consider lowering it
    • Validation loss is lower than training loss: This may indicate a data splitting issue or batch normalization effect
  3. Adjusting based on the curve:

    • If validation loss stops improving early, consider early stopping
    • If training loss decreases very slowly, try increasing the learning rate
    • If there are sudden jumps in the curve, check for data issues or learning rate scheduling

Q21: How do I save and share my model?

Answer: Guidelines for saving and sharing models:

  1. Local saving:

    • After training is complete, the model will be automatically saved in the specified output directory
    • The complete model includes model weights, configuration files, and tokenizer information
  2. Important files:

    • pytorch_model.bin: Model weights
    • config.json: Model configuration
    • special_tokens_map.json and tokenizer_config.json: Tokenizer configuration
  3. Sharing the model:

    • Hugging Face Hub: The easiest way is to upload to Hugging Face

      • Create a model repository
      • Upload your model files
      • Add model descriptions and usage instructions in the readme
    • Local export: You can also compress the model folder and share it

      • Ensure all necessary files are included
      • Provide environment requirements and usage instructions
  4. Documentation: Regardless of the sharing method, you should provide:

    • Description of the training data
    • Model architecture and parameters
    • Performance metrics
    • Usage examples

Interface and Operation Issues

Q22: What should I do if the interface loads slowly or crashes?

Answer: Solutions for interface issues:

  1. Browser-related:

    • Try using different browsers (Chrome usually has the best compatibility)
    • Clear browser cache and cookies
    • Disable unnecessary browser extensions
  2. Resource issues:

    • Ensure the system has enough memory
    • Close other resource-intensive programs
    • If running on a remote server, check the server load
  3. Network issues:

    • Ensure the network connection is stable
    • If using through an SSH tunnel, check if the connection is stable
  4. Restart services:

    • Try restarting the Gradio service
    • In extreme cases, restart the server

Q23: Why does my training stop responding midway?

Answer: Possible reasons and solutions for training stopping responding:

  1. Resource exhaustion:

    • Insufficient system memory
    • GPU memory overflow
    • Solution: Reduce batch size, use more efficient training methods, or increase system resources
  2. Process termination:

    • The system's OOM (Out of Memory) killer terminated the process
    • Server timeout policies may terminate long-running processes
    • Solution: Check system logs, use tools like screen or tmux to run in the background, reduce resource usage
  3. Network or interface issues:

    • Browser crashes or network disconnections
    • Solution: Run training via command line, or ensure a stable network connection
  4. Data or code issues:

    • Anomalies or incorrect formats in the dataset causing processing to hang
    • Solution: Check the dataset, and test the process with a small subset of data