Spaces:
Runtime error
Runtime error
# ProFactory Frequently Asked Questions (FAQ) | |
## Installation and Environment Configuration Issues | |
### Q1: How to properly install ProFactory? | |
**Answer**: You can find the installation step in README.md at the root directory. | |
### Q2: What should I do if I encounter the error "Could not find a specific dependency" during installation? | |
**Answer**: There are several solutions for this situation: | |
1. Try installing the problematic dependency individually: | |
```bash | |
pip install name_of_the_problematic_library | |
``` | |
2. If it is a CUDA-related library, ensure you have installed a PyTorch version compatible with your CUDA version: | |
```bash | |
# For example, for CUDA 11.7 | |
pip install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html | |
``` | |
3. For some special libraries, you may need to install system dependencies first. For example, on Ubuntu: | |
```bash | |
sudo apt-get update | |
sudo apt-get install build-essential | |
``` | |
### Q3: How can I check if my CUDA is installed correctly? | |
**Answer**: You can verify if CUDA is installed correctly by the following methods: | |
1. Check the CUDA version: | |
```bash | |
nvidia-smi | |
``` | |
2. Verify if PyTorch can recognize CUDA in Python: | |
```python | |
import torch | |
print(torch.cuda.is_available()) # Should return True | |
print(torch.cuda.device_count()) # Displays the number of GPUs | |
print(torch.cuda.get_device_name(0)) # Displays the GPU name | |
``` | |
3. If PyTorch cannot recognize CUDA, ensure you have installed the matching versions of PyTorch and CUDA. | |
## Hardware and Resource Issues | |
### Q4: What should I do if I encounter a "CUDA out of memory" error during runtime? | |
**Answer**: This error indicates that your GPU memory is insufficient. Solutions include: | |
1. **Reduce the batch size**: This is the most direct and effective method. Reduce the batch size in the training configuration by half or more. | |
2. **Use a smaller model**: Choose a pre-trained model with fewer parameters, such as switching from ProtBERT to ESM-1b. | |
3. **Enable gradient accumulation**: Increase the `gradient_accumulation_steps` parameter value, for example, set it to 2 or 4, which can reduce memory usage without decreasing the effective batch size. | |
4. **Use mixed precision training**: Enable the `fp16` option in the training options, which can significantly reduce memory usage. | |
5. **Reduce the maximum sequence length**: If your data allows, you can decrease the `max_seq_length` parameter. | |
### Q5: How can I determine what batch size I should use? | |
**Answer**: Determining the appropriate batch size requires balancing memory usage and training effectiveness: | |
1. **Start small and gradually increase**: Begin with smaller values (like 4 or 8) and gradually increase until memory is close to its limit. | |
2. **Refer to benchmarks**: For common protein models, most studies use a batch size of 16-64, but this depends on your GPU memory and sequence length. | |
3. **Monitor the training process**: A larger batch size may make each training iteration more stable but may require a higher learning rate. | |
4. **Rule of thumb for memory issues**: If you encounter memory errors, first try halving the batch size. | |
## Dataset Issues | |
### Q6: How do I prepare a custom dataset? | |
**Answer**: Preparing a custom dataset requires the following steps: | |
1. **Format the data**: The data should be organized into a CSV file, containing at least the following columns: | |
- `sequence`: The protein sequence, represented using standard amino acid letters | |
- Label column: Depending on your task type, this can be numerical (regression) or categorical (classification) | |
2. **Split the data**: Prepare training, validation, and test sets, such as `train.csv`, `validation.csv`, and `test.csv`. | |
3. **Upload to Hugging Face**: | |
- Create a dataset repository on Hugging Face | |
- Upload your CSV file | |
- Reference it in ProFactory using the `username/dataset_name` format | |
4. **Create dataset configuration**: The configuration should include the problem type (regression or classification), number of labels, and evaluation metrics. | |
### Q7: What should I do if I encounter a format error when importing my dataset? | |
**Answer**: Common format issues and their solutions: | |
1. **Incorrect column names**: Ensure the CSV file contains the necessary columns, especially the `sequence` column and label column. | |
2. **Sequence format issues**: | |
- Ensure the sequence contains only valid amino acid letters (ACDEFGHIKLMNPQRSTVWY) | |
- Remove spaces, line breaks, or other illegal characters from the sequence | |
- Check if the sequence length is within a reasonable range | |
3. **Encoding issues**: Ensure the CSV file is saved with UTF-8 encoding. | |
4. **CSV delimiter issues**: Ensure the file uses the correct delimiter (usually a comma). You can use a text editor to view and correct it. | |
5. **Handling missing values**: Ensure there are no missing values in the data, or handle them appropriately. | |
### Q8: My dataset is large, and the system loads slowly or crashes. What should I do? | |
**Answer**: For large datasets, you can: | |
1. **Reduce the dataset size**: If possible, test your method with a subset of the data first. | |
2. **Increase data loading efficiency**: | |
- Use the `batch_size` parameter to control the amount of data loaded at a time | |
- Enable data caching to avoid repeated loading | |
- Preprocess the data to reduce file size (e.g., remove unnecessary columns) | |
3. **Dataset sharding**: Split large datasets into multiple smaller files and process them one by one. | |
4. **Increase system resources**: If possible, increase RAM or use a server with more memory. | |
## Training Issues | |
### Q9: How can I recover if the training suddenly interrupts? | |
**Answer**: Methods to handle training interruptions: | |
1. **Check checkpoints**: The system periodically saves checkpoints (usually in the `ckpt` directory). You can recover from the most recent checkpoint: | |
- Look for the last saved model file (usually named `checkpoint-X`, where X is the step number) | |
- Specify the checkpoint path as the starting point in the training options | |
2. **Use the checkpoint recovery feature**: Enable the checkpoint recovery option in the training configuration. | |
3. **Save checkpoints more frequently**: Adjust the frequency of saving checkpoints, for example, save every 500 steps instead of the default every 1000 steps. | |
### Q10: How can I speed up training if it is very slow? | |
**Answer**: Methods to speed up training: | |
1. **Hardware aspects**: | |
- Use a more powerful GPU | |
- Use multi-GPU training (if supported) | |
- Ensure data is stored on an SSD rather than an HDD | |
2. **Parameter settings**: | |
- Use mixed precision training (enable the fp16 option) | |
- Increase the batch size (if memory allows) | |
- Reduce the maximum sequence length (if the task allows) | |
- Decrease validation frequency (the `eval_steps` parameter) | |
3. **Model selection**: | |
- Choose a smaller pre-trained model | |
- Use parameter-efficient fine-tuning methods (like LoRA) | |
### Q11: What does it mean if the loss value does not decrease or if NaN values appear during training? | |
**Answer**: This usually indicates that there is a problem with the training: | |
1. **Reasons for loss not decreasing and solutions**: | |
- **Learning rate too high**: Try reducing the learning rate, for example, from 5e-5 to 1e-5 | |
- **Optimizer issues**: Try different optimizers, such as switching from Adam to AdamW | |
- **Initialization issues**: Check the model initialization settings | |
- **Data issues**: Validate if the training data has outliers or label errors | |
2. **Reasons for NaN values and solutions**: | |
- **Gradient explosion**: Add gradient clipping, set the `max_grad_norm` parameter | |
- **Learning rate too high**: Significantly reduce the learning rate | |
- **Numerical instability**: This may occur when using mixed precision training; try disabling the fp16 option | |
- **Data anomalies**: Check if there are extreme values in the input data | |
### Q12: What is overfitting, and how can it be avoided? | |
**Answer**: Overfitting refers to a model performing well on training data but poorly on new data. Methods to avoid overfitting include: | |
1. **Increase the amount of data**: Use more training data or data augmentation techniques. | |
2. **Regularization methods**: | |
- Add dropout (usually set to 0.1-0.3) | |
- Use weight decay | |
- Early stopping: Stop training when the validation performance no longer improves | |
3. **Simplify the model**: | |
- Use fewer layers or smaller hidden dimensions | |
- Freeze some layers of the pre-trained model (using the freeze method) | |
4. **Cross-validation**: Use k-fold cross-validation to obtain a more robust model. | |
## Evaluation Issues | |
### Q13: How do I interpret evaluation metrics? Which metric is the most important? | |
**Answer**: Different tasks focus on different metrics: | |
1. **Classification tasks**: | |
- **Accuracy**: The proportion of correct predictions, suitable for balanced datasets | |
- **F1 Score**: The harmonic mean of precision and recall, suitable for imbalanced datasets | |
- **MCC (Matthews Correlation Coefficient)**: A comprehensive measure of classification performance, more robust to class imbalance | |
- **AUROC (Area Under the ROC Curve)**: Measures the model's ability to distinguish between different classes | |
2. **Regression tasks**: | |
- **MSE (Mean Squared Error)**: The sum of the squared differences between predicted and actual values, the smaller the better | |
- **RMSE (Root Mean Squared Error)**: The square root of MSE, in the same units as the original data | |
- **MAE (Mean Absolute Error)**: The average of the absolute differences between predicted and actual values | |
- **R² (Coefficient of Determination)**: Measures the proportion of variance explained by the model, the closer to 1 the better | |
3. **Most important metric**: Depends on your specific application needs. For example, in drug screening, you may focus more on true positive rates; for structural prediction, you may focus more on RMSE. | |
### Q14: What should I do if the evaluation results are poor? | |
**Answer**: Common strategies to improve model performance: | |
1. **Data quality**: | |
- Check for errors or noise in the data | |
- Increase the number of training samples | |
- Ensure the training and test set distributions are similar | |
2. **Model adjustments**: | |
- Try different pre-trained models | |
- Adjust hyperparameters like learning rate and batch size | |
- Use different fine-tuning methods (full parameter fine-tuning, LoRA, etc.) | |
3. **Feature engineering**: | |
- Add structural information (e.g., using foldseek features) | |
- Consider sequence characteristics (e.g., hydrophobicity, charge, etc.) | |
4. **Ensemble methods**: | |
- Train multiple models and combine results | |
- Use cross-validation to obtain a more robust model | |
### Q15: Why does my model perform much worse on the test set than on the validation set? | |
**Answer**: Common reasons for decreased performance on the test set: | |
1. **Data distribution shift**: | |
- The training, validation, and test set distributions are inconsistent | |
- The test set contains protein families or features not seen during training | |
2. **Overfitting**: | |
- The model overfits the validation set because it was used for model selection | |
- Increasing regularization or reducing the number of training epochs may help | |
3. **Data leakage**: | |
- Unintentionally leaking test data information into the training process | |
- Ensure data splitting is done before preprocessing to avoid cross-contamination | |
4. **Randomness**: | |
- If the test set is small, results may be influenced by randomness | |
- Try training multiple models with different random seeds and averaging the results | |
## Prediction Issues | |
### Q16: How can I speed up the prediction process? | |
**Answer**: Methods to speed up predictions: | |
1. **Batch prediction**: Use batch prediction mode instead of single-sequence prediction, which can utilize the GPU more efficiently. | |
2. **Reduce computation**: | |
- Use a smaller model or a more efficient fine-tuning method | |
- Reduce the maximum sequence length (if possible) | |
3. **Hardware optimization**: | |
- Use a faster GPU or CPU | |
- Ensure predictions are done on the GPU rather than the CPU | |
4. **Model optimization**: | |
- Try model quantization (e.g., int8 quantization) | |
- Exporting to ONNX format may provide faster inference speeds | |
### Q17: What could be the reason for the prediction results being significantly different from expectations? | |
**Answer**: Possible reasons for prediction discrepancies: | |
1. **Data mismatch**: | |
- The sequences being predicted differ from the training data distribution | |
- There are significant differences in sequence length, composition, or structural features | |
2. **Model issues**: | |
- The model is under-trained or overfitted | |
- An unsuitable pre-trained model was chosen for the task | |
3. **Parameter configuration**: | |
- Ensure the parameters used during prediction (like maximum sequence length) are consistent with those used during training | |
- Check if the correct problem type (classification/regression) is being used | |
4. **Data preprocessing**: | |
- Ensure the prediction data undergoes the same preprocessing steps as the training data | |
- Check if the sequence format is correct (standard amino acid letters, no special characters) | |
### Q18: How can I batch predict a large number of sequences? | |
**Answer**: Steps for efficient batch prediction: | |
1. **Prepare the input file**: | |
- Create a CSV file containing all sequences | |
- The file must include a `sequence` column | |
- Optionally include an ID or other identifier columns | |
2. **Use the batch prediction feature**: | |
- Go to the prediction tab | |
- Select "Batch Prediction" mode | |
- Upload the sequence file | |
- Set an appropriate batch size (usually 16-32 is a good balance) | |
3. **Optimize settings**: | |
- Increasing the batch size can improve throughput (if memory allows) | |
- Reducing unnecessary feature calculations can speed up processing | |
4. **Result handling**: | |
- After prediction is complete, the system will generate a CSV file containing the original sequences and prediction results | |
- You can download this file for further analysis | |
## Model and Result Issues | |
### Q19: Which pre-trained model should I choose? | |
**Answer**: Model selection recommendations: | |
1. **For general tasks**: | |
- ESM-2 is suitable for various protein-related tasks, balancing performance and efficiency | |
- ProtBERT performs well on certain sequence classification tasks | |
2. **Considerations**: | |
- **Data volume**: When data is limited, a smaller model may be better (to avoid overfitting) | |
- **Sequence length**: For long sequences, consider models that support longer contexts | |
- **Computational resources**: When resources are limited, choose smaller models or parameter-efficient methods | |
- **Task type**: Different models have their advantages in different tasks | |
3. **Recommended strategy**: If conditions allow, try several different models and choose the one that performs best on the validation set. | |
### Q20: How do I interpret the loss curve during training? | |
**Answer**: Guidelines for interpreting the loss curve: | |
1. **Ideal curve**: | |
- Both training loss and validation loss decrease steadily | |
- The two curves eventually stabilize and converge | |
- The validation loss stabilizes near its lowest point | |
2. **Common patterns and their meanings**: | |
- **Training loss continues to decrease while validation loss increases**: Signal of overfitting; consider increasing regularization | |
- **Both losses stagnate at high values**: Indicates underfitting; may need a more complex model or longer training | |
- **Curve fluctuates dramatically**: The learning rate may be too high; consider lowering it | |
- **Validation loss is lower than training loss**: This may indicate a data splitting issue or batch normalization effect | |
3. **Adjusting based on the curve**: | |
- If validation loss stops improving early, consider early stopping | |
- If training loss decreases very slowly, try increasing the learning rate | |
- If there are sudden jumps in the curve, check for data issues or learning rate scheduling | |
### Q21: How do I save and share my model? | |
**Answer**: Guidelines for saving and sharing models: | |
1. **Local saving**: | |
- After training is complete, the model will be automatically saved in the specified output directory | |
- The complete model includes model weights, configuration files, and tokenizer information | |
2. **Important files**: | |
- `pytorch_model.bin`: Model weights | |
- `config.json`: Model configuration | |
- `special_tokens_map.json` and `tokenizer_config.json`: Tokenizer configuration | |
3. **Sharing the model**: | |
- **Hugging Face Hub**: The easiest way is to upload to Hugging Face | |
- Create a model repository | |
- Upload your model files | |
- Add model descriptions and usage instructions in the readme | |
- **Local export**: You can also compress the model folder and share it | |
- Ensure all necessary files are included | |
- Provide environment requirements and usage instructions | |
4. **Documentation**: Regardless of the sharing method, you should provide: | |
- Description of the training data | |
- Model architecture and parameters | |
- Performance metrics | |
- Usage examples | |
## Interface and Operation Issues | |
### Q22: What should I do if the interface loads slowly or crashes? | |
**Answer**: Solutions for interface issues: | |
1. **Browser-related**: | |
- Try using different browsers (Chrome usually has the best compatibility) | |
- Clear browser cache and cookies | |
- Disable unnecessary browser extensions | |
2. **Resource issues**: | |
- Ensure the system has enough memory | |
- Close other resource-intensive programs | |
- If running on a remote server, check the server load | |
3. **Network issues**: | |
- Ensure the network connection is stable | |
- If using through an SSH tunnel, check if the connection is stable | |
4. **Restart services**: | |
- Try restarting the Gradio service | |
- In extreme cases, restart the server | |
### Q23: Why does my training stop responding midway? | |
**Answer**: Possible reasons and solutions for training stopping responding: | |
1. **Resource exhaustion**: | |
- Insufficient system memory | |
- GPU memory overflow | |
- Solution: Reduce batch size, use more efficient training methods, or increase system resources | |
2. **Process termination**: | |
- The system's OOM (Out of Memory) killer terminated the process | |
- Server timeout policies may terminate long-running processes | |
- Solution: Check system logs, use tools like screen or tmux to run in the background, reduce resource usage | |
3. **Network or interface issues**: | |
- Browser crashes or network disconnections | |
- Solution: Run training via command line, or ensure a stable network connection | |
4. **Data or code issues**: | |
- Anomalies or incorrect formats in the dataset causing processing to hang | |
- Solution: Check the dataset, and test the process with a small subset of data |