Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.29.0
ProFactory Frequently Asked Questions (FAQ)
Installation and Environment Configuration Issues
Q1: How to properly install ProFactory?
Answer: You can find the installation step in README.md at the root directory.
Q2: What should I do if I encounter the error "Could not find a specific dependency" during installation?
Answer: There are several solutions for this situation:
Try installing the problematic dependency individually:
pip install name_of_the_problematic_library
If it is a CUDA-related library, ensure you have installed a PyTorch version compatible with your CUDA version:
# For example, for CUDA 11.7 pip install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
For some special libraries, you may need to install system dependencies first. For example, on Ubuntu:
sudo apt-get update sudo apt-get install build-essential
Q3: How can I check if my CUDA is installed correctly?
Answer: You can verify if CUDA is installed correctly by the following methods:
Check the CUDA version:
nvidia-smi
Verify if PyTorch can recognize CUDA in Python:
import torch print(torch.cuda.is_available()) # Should return True print(torch.cuda.device_count()) # Displays the number of GPUs print(torch.cuda.get_device_name(0)) # Displays the GPU name
If PyTorch cannot recognize CUDA, ensure you have installed the matching versions of PyTorch and CUDA.
Hardware and Resource Issues
Q4: What should I do if I encounter a "CUDA out of memory" error during runtime?
Answer: This error indicates that your GPU memory is insufficient. Solutions include:
Reduce the batch size: This is the most direct and effective method. Reduce the batch size in the training configuration by half or more.
Use a smaller model: Choose a pre-trained model with fewer parameters, such as switching from ProtBERT to ESM-1b.
Enable gradient accumulation: Increase the
gradient_accumulation_steps
parameter value, for example, set it to 2 or 4, which can reduce memory usage without decreasing the effective batch size.Use mixed precision training: Enable the
fp16
option in the training options, which can significantly reduce memory usage.Reduce the maximum sequence length: If your data allows, you can decrease the
max_seq_length
parameter.
Q5: How can I determine what batch size I should use?
Answer: Determining the appropriate batch size requires balancing memory usage and training effectiveness:
Start small and gradually increase: Begin with smaller values (like 4 or 8) and gradually increase until memory is close to its limit.
Refer to benchmarks: For common protein models, most studies use a batch size of 16-64, but this depends on your GPU memory and sequence length.
Monitor the training process: A larger batch size may make each training iteration more stable but may require a higher learning rate.
Rule of thumb for memory issues: If you encounter memory errors, first try halving the batch size.
Dataset Issues
Q6: How do I prepare a custom dataset?
Answer: Preparing a custom dataset requires the following steps:
Format the data: The data should be organized into a CSV file, containing at least the following columns:
sequence
: The protein sequence, represented using standard amino acid letters- Label column: Depending on your task type, this can be numerical (regression) or categorical (classification)
Split the data: Prepare training, validation, and test sets, such as
train.csv
,validation.csv
, andtest.csv
.Upload to Hugging Face:
- Create a dataset repository on Hugging Face
- Upload your CSV file
- Reference it in ProFactory using the
username/dataset_name
format
Create dataset configuration: The configuration should include the problem type (regression or classification), number of labels, and evaluation metrics.
Q7: What should I do if I encounter a format error when importing my dataset?
Answer: Common format issues and their solutions:
Incorrect column names: Ensure the CSV file contains the necessary columns, especially the
sequence
column and label column.Sequence format issues:
- Ensure the sequence contains only valid amino acid letters (ACDEFGHIKLMNPQRSTVWY)
- Remove spaces, line breaks, or other illegal characters from the sequence
- Check if the sequence length is within a reasonable range
Encoding issues: Ensure the CSV file is saved with UTF-8 encoding.
CSV delimiter issues: Ensure the file uses the correct delimiter (usually a comma). You can use a text editor to view and correct it.
Handling missing values: Ensure there are no missing values in the data, or handle them appropriately.
Q8: My dataset is large, and the system loads slowly or crashes. What should I do?
Answer: For large datasets, you can:
Reduce the dataset size: If possible, test your method with a subset of the data first.
Increase data loading efficiency:
- Use the
batch_size
parameter to control the amount of data loaded at a time - Enable data caching to avoid repeated loading
- Preprocess the data to reduce file size (e.g., remove unnecessary columns)
- Use the
Dataset sharding: Split large datasets into multiple smaller files and process them one by one.
Increase system resources: If possible, increase RAM or use a server with more memory.
Training Issues
Q9: How can I recover if the training suddenly interrupts?
Answer: Methods to handle training interruptions:
Check checkpoints: The system periodically saves checkpoints (usually in the
ckpt
directory). You can recover from the most recent checkpoint:- Look for the last saved model file (usually named
checkpoint-X
, where X is the step number) - Specify the checkpoint path as the starting point in the training options
- Look for the last saved model file (usually named
Use the checkpoint recovery feature: Enable the checkpoint recovery option in the training configuration.
Save checkpoints more frequently: Adjust the frequency of saving checkpoints, for example, save every 500 steps instead of the default every 1000 steps.
Q10: How can I speed up training if it is very slow?
Answer: Methods to speed up training:
Hardware aspects:
- Use a more powerful GPU
- Use multi-GPU training (if supported)
- Ensure data is stored on an SSD rather than an HDD
Parameter settings:
- Use mixed precision training (enable the fp16 option)
- Increase the batch size (if memory allows)
- Reduce the maximum sequence length (if the task allows)
- Decrease validation frequency (the
eval_steps
parameter)
Model selection:
- Choose a smaller pre-trained model
- Use parameter-efficient fine-tuning methods (like LoRA)
Q11: What does it mean if the loss value does not decrease or if NaN values appear during training?
Answer: This usually indicates that there is a problem with the training:
Reasons for loss not decreasing and solutions:
- Learning rate too high: Try reducing the learning rate, for example, from 5e-5 to 1e-5
- Optimizer issues: Try different optimizers, such as switching from Adam to AdamW
- Initialization issues: Check the model initialization settings
- Data issues: Validate if the training data has outliers or label errors
Reasons for NaN values and solutions:
- Gradient explosion: Add gradient clipping, set the
max_grad_norm
parameter - Learning rate too high: Significantly reduce the learning rate
- Numerical instability: This may occur when using mixed precision training; try disabling the fp16 option
- Data anomalies: Check if there are extreme values in the input data
- Gradient explosion: Add gradient clipping, set the
Q12: What is overfitting, and how can it be avoided?
Answer: Overfitting refers to a model performing well on training data but poorly on new data. Methods to avoid overfitting include:
Increase the amount of data: Use more training data or data augmentation techniques.
Regularization methods:
- Add dropout (usually set to 0.1-0.3)
- Use weight decay
- Early stopping: Stop training when the validation performance no longer improves
Simplify the model:
- Use fewer layers or smaller hidden dimensions
- Freeze some layers of the pre-trained model (using the freeze method)
Cross-validation: Use k-fold cross-validation to obtain a more robust model.
Evaluation Issues
Q13: How do I interpret evaluation metrics? Which metric is the most important?
Answer: Different tasks focus on different metrics:
Classification tasks:
- Accuracy: The proportion of correct predictions, suitable for balanced datasets
- F1 Score: The harmonic mean of precision and recall, suitable for imbalanced datasets
- MCC (Matthews Correlation Coefficient): A comprehensive measure of classification performance, more robust to class imbalance
- AUROC (Area Under the ROC Curve): Measures the model's ability to distinguish between different classes
Regression tasks:
- MSE (Mean Squared Error): The sum of the squared differences between predicted and actual values, the smaller the better
- RMSE (Root Mean Squared Error): The square root of MSE, in the same units as the original data
- MAE (Mean Absolute Error): The average of the absolute differences between predicted and actual values
- R² (Coefficient of Determination): Measures the proportion of variance explained by the model, the closer to 1 the better
Most important metric: Depends on your specific application needs. For example, in drug screening, you may focus more on true positive rates; for structural prediction, you may focus more on RMSE.
Q14: What should I do if the evaluation results are poor?
Answer: Common strategies to improve model performance:
Data quality:
- Check for errors or noise in the data
- Increase the number of training samples
- Ensure the training and test set distributions are similar
Model adjustments:
- Try different pre-trained models
- Adjust hyperparameters like learning rate and batch size
- Use different fine-tuning methods (full parameter fine-tuning, LoRA, etc.)
Feature engineering:
- Add structural information (e.g., using foldseek features)
- Consider sequence characteristics (e.g., hydrophobicity, charge, etc.)
Ensemble methods:
- Train multiple models and combine results
- Use cross-validation to obtain a more robust model
Q15: Why does my model perform much worse on the test set than on the validation set?
Answer: Common reasons for decreased performance on the test set:
Data distribution shift:
- The training, validation, and test set distributions are inconsistent
- The test set contains protein families or features not seen during training
Overfitting:
- The model overfits the validation set because it was used for model selection
- Increasing regularization or reducing the number of training epochs may help
Data leakage:
- Unintentionally leaking test data information into the training process
- Ensure data splitting is done before preprocessing to avoid cross-contamination
Randomness:
- If the test set is small, results may be influenced by randomness
- Try training multiple models with different random seeds and averaging the results
Prediction Issues
Q16: How can I speed up the prediction process?
Answer: Methods to speed up predictions:
Batch prediction: Use batch prediction mode instead of single-sequence prediction, which can utilize the GPU more efficiently.
Reduce computation:
- Use a smaller model or a more efficient fine-tuning method
- Reduce the maximum sequence length (if possible)
Hardware optimization:
- Use a faster GPU or CPU
- Ensure predictions are done on the GPU rather than the CPU
Model optimization:
- Try model quantization (e.g., int8 quantization)
- Exporting to ONNX format may provide faster inference speeds
Q17: What could be the reason for the prediction results being significantly different from expectations?
Answer: Possible reasons for prediction discrepancies:
Data mismatch:
- The sequences being predicted differ from the training data distribution
- There are significant differences in sequence length, composition, or structural features
Model issues:
- The model is under-trained or overfitted
- An unsuitable pre-trained model was chosen for the task
Parameter configuration:
- Ensure the parameters used during prediction (like maximum sequence length) are consistent with those used during training
- Check if the correct problem type (classification/regression) is being used
Data preprocessing:
- Ensure the prediction data undergoes the same preprocessing steps as the training data
- Check if the sequence format is correct (standard amino acid letters, no special characters)
Q18: How can I batch predict a large number of sequences?
Answer: Steps for efficient batch prediction:
Prepare the input file:
- Create a CSV file containing all sequences
- The file must include a
sequence
column - Optionally include an ID or other identifier columns
Use the batch prediction feature:
- Go to the prediction tab
- Select "Batch Prediction" mode
- Upload the sequence file
- Set an appropriate batch size (usually 16-32 is a good balance)
Optimize settings:
- Increasing the batch size can improve throughput (if memory allows)
- Reducing unnecessary feature calculations can speed up processing
Result handling:
- After prediction is complete, the system will generate a CSV file containing the original sequences and prediction results
- You can download this file for further analysis
Model and Result Issues
Q19: Which pre-trained model should I choose?
Answer: Model selection recommendations:
For general tasks:
- ESM-2 is suitable for various protein-related tasks, balancing performance and efficiency
- ProtBERT performs well on certain sequence classification tasks
Considerations:
- Data volume: When data is limited, a smaller model may be better (to avoid overfitting)
- Sequence length: For long sequences, consider models that support longer contexts
- Computational resources: When resources are limited, choose smaller models or parameter-efficient methods
- Task type: Different models have their advantages in different tasks
Recommended strategy: If conditions allow, try several different models and choose the one that performs best on the validation set.
Q20: How do I interpret the loss curve during training?
Answer: Guidelines for interpreting the loss curve:
Ideal curve:
- Both training loss and validation loss decrease steadily
- The two curves eventually stabilize and converge
- The validation loss stabilizes near its lowest point
Common patterns and their meanings:
- Training loss continues to decrease while validation loss increases: Signal of overfitting; consider increasing regularization
- Both losses stagnate at high values: Indicates underfitting; may need a more complex model or longer training
- Curve fluctuates dramatically: The learning rate may be too high; consider lowering it
- Validation loss is lower than training loss: This may indicate a data splitting issue or batch normalization effect
Adjusting based on the curve:
- If validation loss stops improving early, consider early stopping
- If training loss decreases very slowly, try increasing the learning rate
- If there are sudden jumps in the curve, check for data issues or learning rate scheduling
Q21: How do I save and share my model?
Answer: Guidelines for saving and sharing models:
Local saving:
- After training is complete, the model will be automatically saved in the specified output directory
- The complete model includes model weights, configuration files, and tokenizer information
Important files:
pytorch_model.bin
: Model weightsconfig.json
: Model configurationspecial_tokens_map.json
andtokenizer_config.json
: Tokenizer configuration
Sharing the model:
Hugging Face Hub: The easiest way is to upload to Hugging Face
- Create a model repository
- Upload your model files
- Add model descriptions and usage instructions in the readme
Local export: You can also compress the model folder and share it
- Ensure all necessary files are included
- Provide environment requirements and usage instructions
Documentation: Regardless of the sharing method, you should provide:
- Description of the training data
- Model architecture and parameters
- Performance metrics
- Usage examples
Interface and Operation Issues
Q22: What should I do if the interface loads slowly or crashes?
Answer: Solutions for interface issues:
Browser-related:
- Try using different browsers (Chrome usually has the best compatibility)
- Clear browser cache and cookies
- Disable unnecessary browser extensions
Resource issues:
- Ensure the system has enough memory
- Close other resource-intensive programs
- If running on a remote server, check the server load
Network issues:
- Ensure the network connection is stable
- If using through an SSH tunnel, check if the connection is stable
Restart services:
- Try restarting the Gradio service
- In extreme cases, restart the server
Q23: Why does my training stop responding midway?
Answer: Possible reasons and solutions for training stopping responding:
Resource exhaustion:
- Insufficient system memory
- GPU memory overflow
- Solution: Reduce batch size, use more efficient training methods, or increase system resources
Process termination:
- The system's OOM (Out of Memory) killer terminated the process
- Server timeout policies may terminate long-running processes
- Solution: Check system logs, use tools like screen or tmux to run in the background, reduce resource usage
Network or interface issues:
- Browser crashes or network disconnections
- Solution: Run training via command line, or ensure a stable network connection
Data or code issues:
- Anomalies or incorrect formats in the dataset causing processing to hang
- Solution: Check the dataset, and test the process with a small subset of data