Spaces:
Runtime error
Runtime error
File size: 19,514 Bytes
8918ac7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 |
# ProFactory Frequently Asked Questions (FAQ)
## Installation and Environment Configuration Issues
### Q1: How to properly install ProFactory?
**Answer**: You can find the installation step in README.md at the root directory.
### Q2: What should I do if I encounter the error "Could not find a specific dependency" during installation?
**Answer**: There are several solutions for this situation:
1. Try installing the problematic dependency individually:
```bash
pip install name_of_the_problematic_library
```
2. If it is a CUDA-related library, ensure you have installed a PyTorch version compatible with your CUDA version:
```bash
# For example, for CUDA 11.7
pip install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
```
3. For some special libraries, you may need to install system dependencies first. For example, on Ubuntu:
```bash
sudo apt-get update
sudo apt-get install build-essential
```
### Q3: How can I check if my CUDA is installed correctly?
**Answer**: You can verify if CUDA is installed correctly by the following methods:
1. Check the CUDA version:
```bash
nvidia-smi
```
2. Verify if PyTorch can recognize CUDA in Python:
```python
import torch
print(torch.cuda.is_available()) # Should return True
print(torch.cuda.device_count()) # Displays the number of GPUs
print(torch.cuda.get_device_name(0)) # Displays the GPU name
```
3. If PyTorch cannot recognize CUDA, ensure you have installed the matching versions of PyTorch and CUDA.
## Hardware and Resource Issues
### Q4: What should I do if I encounter a "CUDA out of memory" error during runtime?
**Answer**: This error indicates that your GPU memory is insufficient. Solutions include:
1. **Reduce the batch size**: This is the most direct and effective method. Reduce the batch size in the training configuration by half or more.
2. **Use a smaller model**: Choose a pre-trained model with fewer parameters, such as switching from ProtBERT to ESM-1b.
3. **Enable gradient accumulation**: Increase the `gradient_accumulation_steps` parameter value, for example, set it to 2 or 4, which can reduce memory usage without decreasing the effective batch size.
4. **Use mixed precision training**: Enable the `fp16` option in the training options, which can significantly reduce memory usage.
5. **Reduce the maximum sequence length**: If your data allows, you can decrease the `max_seq_length` parameter.
### Q5: How can I determine what batch size I should use?
**Answer**: Determining the appropriate batch size requires balancing memory usage and training effectiveness:
1. **Start small and gradually increase**: Begin with smaller values (like 4 or 8) and gradually increase until memory is close to its limit.
2. **Refer to benchmarks**: For common protein models, most studies use a batch size of 16-64, but this depends on your GPU memory and sequence length.
3. **Monitor the training process**: A larger batch size may make each training iteration more stable but may require a higher learning rate.
4. **Rule of thumb for memory issues**: If you encounter memory errors, first try halving the batch size.
## Dataset Issues
### Q6: How do I prepare a custom dataset?
**Answer**: Preparing a custom dataset requires the following steps:
1. **Format the data**: The data should be organized into a CSV file, containing at least the following columns:
- `sequence`: The protein sequence, represented using standard amino acid letters
- Label column: Depending on your task type, this can be numerical (regression) or categorical (classification)
2. **Split the data**: Prepare training, validation, and test sets, such as `train.csv`, `validation.csv`, and `test.csv`.
3. **Upload to Hugging Face**:
- Create a dataset repository on Hugging Face
- Upload your CSV file
- Reference it in ProFactory using the `username/dataset_name` format
4. **Create dataset configuration**: The configuration should include the problem type (regression or classification), number of labels, and evaluation metrics.
### Q7: What should I do if I encounter a format error when importing my dataset?
**Answer**: Common format issues and their solutions:
1. **Incorrect column names**: Ensure the CSV file contains the necessary columns, especially the `sequence` column and label column.
2. **Sequence format issues**:
- Ensure the sequence contains only valid amino acid letters (ACDEFGHIKLMNPQRSTVWY)
- Remove spaces, line breaks, or other illegal characters from the sequence
- Check if the sequence length is within a reasonable range
3. **Encoding issues**: Ensure the CSV file is saved with UTF-8 encoding.
4. **CSV delimiter issues**: Ensure the file uses the correct delimiter (usually a comma). You can use a text editor to view and correct it.
5. **Handling missing values**: Ensure there are no missing values in the data, or handle them appropriately.
### Q8: My dataset is large, and the system loads slowly or crashes. What should I do?
**Answer**: For large datasets, you can:
1. **Reduce the dataset size**: If possible, test your method with a subset of the data first.
2. **Increase data loading efficiency**:
- Use the `batch_size` parameter to control the amount of data loaded at a time
- Enable data caching to avoid repeated loading
- Preprocess the data to reduce file size (e.g., remove unnecessary columns)
3. **Dataset sharding**: Split large datasets into multiple smaller files and process them one by one.
4. **Increase system resources**: If possible, increase RAM or use a server with more memory.
## Training Issues
### Q9: How can I recover if the training suddenly interrupts?
**Answer**: Methods to handle training interruptions:
1. **Check checkpoints**: The system periodically saves checkpoints (usually in the `ckpt` directory). You can recover from the most recent checkpoint:
- Look for the last saved model file (usually named `checkpoint-X`, where X is the step number)
- Specify the checkpoint path as the starting point in the training options
2. **Use the checkpoint recovery feature**: Enable the checkpoint recovery option in the training configuration.
3. **Save checkpoints more frequently**: Adjust the frequency of saving checkpoints, for example, save every 500 steps instead of the default every 1000 steps.
### Q10: How can I speed up training if it is very slow?
**Answer**: Methods to speed up training:
1. **Hardware aspects**:
- Use a more powerful GPU
- Use multi-GPU training (if supported)
- Ensure data is stored on an SSD rather than an HDD
2. **Parameter settings**:
- Use mixed precision training (enable the fp16 option)
- Increase the batch size (if memory allows)
- Reduce the maximum sequence length (if the task allows)
- Decrease validation frequency (the `eval_steps` parameter)
3. **Model selection**:
- Choose a smaller pre-trained model
- Use parameter-efficient fine-tuning methods (like LoRA)
### Q11: What does it mean if the loss value does not decrease or if NaN values appear during training?
**Answer**: This usually indicates that there is a problem with the training:
1. **Reasons for loss not decreasing and solutions**:
- **Learning rate too high**: Try reducing the learning rate, for example, from 5e-5 to 1e-5
- **Optimizer issues**: Try different optimizers, such as switching from Adam to AdamW
- **Initialization issues**: Check the model initialization settings
- **Data issues**: Validate if the training data has outliers or label errors
2. **Reasons for NaN values and solutions**:
- **Gradient explosion**: Add gradient clipping, set the `max_grad_norm` parameter
- **Learning rate too high**: Significantly reduce the learning rate
- **Numerical instability**: This may occur when using mixed precision training; try disabling the fp16 option
- **Data anomalies**: Check if there are extreme values in the input data
### Q12: What is overfitting, and how can it be avoided?
**Answer**: Overfitting refers to a model performing well on training data but poorly on new data. Methods to avoid overfitting include:
1. **Increase the amount of data**: Use more training data or data augmentation techniques.
2. **Regularization methods**:
- Add dropout (usually set to 0.1-0.3)
- Use weight decay
- Early stopping: Stop training when the validation performance no longer improves
3. **Simplify the model**:
- Use fewer layers or smaller hidden dimensions
- Freeze some layers of the pre-trained model (using the freeze method)
4. **Cross-validation**: Use k-fold cross-validation to obtain a more robust model.
## Evaluation Issues
### Q13: How do I interpret evaluation metrics? Which metric is the most important?
**Answer**: Different tasks focus on different metrics:
1. **Classification tasks**:
- **Accuracy**: The proportion of correct predictions, suitable for balanced datasets
- **F1 Score**: The harmonic mean of precision and recall, suitable for imbalanced datasets
- **MCC (Matthews Correlation Coefficient)**: A comprehensive measure of classification performance, more robust to class imbalance
- **AUROC (Area Under the ROC Curve)**: Measures the model's ability to distinguish between different classes
2. **Regression tasks**:
- **MSE (Mean Squared Error)**: The sum of the squared differences between predicted and actual values, the smaller the better
- **RMSE (Root Mean Squared Error)**: The square root of MSE, in the same units as the original data
- **MAE (Mean Absolute Error)**: The average of the absolute differences between predicted and actual values
- **R² (Coefficient of Determination)**: Measures the proportion of variance explained by the model, the closer to 1 the better
3. **Most important metric**: Depends on your specific application needs. For example, in drug screening, you may focus more on true positive rates; for structural prediction, you may focus more on RMSE.
### Q14: What should I do if the evaluation results are poor?
**Answer**: Common strategies to improve model performance:
1. **Data quality**:
- Check for errors or noise in the data
- Increase the number of training samples
- Ensure the training and test set distributions are similar
2. **Model adjustments**:
- Try different pre-trained models
- Adjust hyperparameters like learning rate and batch size
- Use different fine-tuning methods (full parameter fine-tuning, LoRA, etc.)
3. **Feature engineering**:
- Add structural information (e.g., using foldseek features)
- Consider sequence characteristics (e.g., hydrophobicity, charge, etc.)
4. **Ensemble methods**:
- Train multiple models and combine results
- Use cross-validation to obtain a more robust model
### Q15: Why does my model perform much worse on the test set than on the validation set?
**Answer**: Common reasons for decreased performance on the test set:
1. **Data distribution shift**:
- The training, validation, and test set distributions are inconsistent
- The test set contains protein families or features not seen during training
2. **Overfitting**:
- The model overfits the validation set because it was used for model selection
- Increasing regularization or reducing the number of training epochs may help
3. **Data leakage**:
- Unintentionally leaking test data information into the training process
- Ensure data splitting is done before preprocessing to avoid cross-contamination
4. **Randomness**:
- If the test set is small, results may be influenced by randomness
- Try training multiple models with different random seeds and averaging the results
## Prediction Issues
### Q16: How can I speed up the prediction process?
**Answer**: Methods to speed up predictions:
1. **Batch prediction**: Use batch prediction mode instead of single-sequence prediction, which can utilize the GPU more efficiently.
2. **Reduce computation**:
- Use a smaller model or a more efficient fine-tuning method
- Reduce the maximum sequence length (if possible)
3. **Hardware optimization**:
- Use a faster GPU or CPU
- Ensure predictions are done on the GPU rather than the CPU
4. **Model optimization**:
- Try model quantization (e.g., int8 quantization)
- Exporting to ONNX format may provide faster inference speeds
### Q17: What could be the reason for the prediction results being significantly different from expectations?
**Answer**: Possible reasons for prediction discrepancies:
1. **Data mismatch**:
- The sequences being predicted differ from the training data distribution
- There are significant differences in sequence length, composition, or structural features
2. **Model issues**:
- The model is under-trained or overfitted
- An unsuitable pre-trained model was chosen for the task
3. **Parameter configuration**:
- Ensure the parameters used during prediction (like maximum sequence length) are consistent with those used during training
- Check if the correct problem type (classification/regression) is being used
4. **Data preprocessing**:
- Ensure the prediction data undergoes the same preprocessing steps as the training data
- Check if the sequence format is correct (standard amino acid letters, no special characters)
### Q18: How can I batch predict a large number of sequences?
**Answer**: Steps for efficient batch prediction:
1. **Prepare the input file**:
- Create a CSV file containing all sequences
- The file must include a `sequence` column
- Optionally include an ID or other identifier columns
2. **Use the batch prediction feature**:
- Go to the prediction tab
- Select "Batch Prediction" mode
- Upload the sequence file
- Set an appropriate batch size (usually 16-32 is a good balance)
3. **Optimize settings**:
- Increasing the batch size can improve throughput (if memory allows)
- Reducing unnecessary feature calculations can speed up processing
4. **Result handling**:
- After prediction is complete, the system will generate a CSV file containing the original sequences and prediction results
- You can download this file for further analysis
## Model and Result Issues
### Q19: Which pre-trained model should I choose?
**Answer**: Model selection recommendations:
1. **For general tasks**:
- ESM-2 is suitable for various protein-related tasks, balancing performance and efficiency
- ProtBERT performs well on certain sequence classification tasks
2. **Considerations**:
- **Data volume**: When data is limited, a smaller model may be better (to avoid overfitting)
- **Sequence length**: For long sequences, consider models that support longer contexts
- **Computational resources**: When resources are limited, choose smaller models or parameter-efficient methods
- **Task type**: Different models have their advantages in different tasks
3. **Recommended strategy**: If conditions allow, try several different models and choose the one that performs best on the validation set.
### Q20: How do I interpret the loss curve during training?
**Answer**: Guidelines for interpreting the loss curve:
1. **Ideal curve**:
- Both training loss and validation loss decrease steadily
- The two curves eventually stabilize and converge
- The validation loss stabilizes near its lowest point
2. **Common patterns and their meanings**:
- **Training loss continues to decrease while validation loss increases**: Signal of overfitting; consider increasing regularization
- **Both losses stagnate at high values**: Indicates underfitting; may need a more complex model or longer training
- **Curve fluctuates dramatically**: The learning rate may be too high; consider lowering it
- **Validation loss is lower than training loss**: This may indicate a data splitting issue or batch normalization effect
3. **Adjusting based on the curve**:
- If validation loss stops improving early, consider early stopping
- If training loss decreases very slowly, try increasing the learning rate
- If there are sudden jumps in the curve, check for data issues or learning rate scheduling
### Q21: How do I save and share my model?
**Answer**: Guidelines for saving and sharing models:
1. **Local saving**:
- After training is complete, the model will be automatically saved in the specified output directory
- The complete model includes model weights, configuration files, and tokenizer information
2. **Important files**:
- `pytorch_model.bin`: Model weights
- `config.json`: Model configuration
- `special_tokens_map.json` and `tokenizer_config.json`: Tokenizer configuration
3. **Sharing the model**:
- **Hugging Face Hub**: The easiest way is to upload to Hugging Face
- Create a model repository
- Upload your model files
- Add model descriptions and usage instructions in the readme
- **Local export**: You can also compress the model folder and share it
- Ensure all necessary files are included
- Provide environment requirements and usage instructions
4. **Documentation**: Regardless of the sharing method, you should provide:
- Description of the training data
- Model architecture and parameters
- Performance metrics
- Usage examples
## Interface and Operation Issues
### Q22: What should I do if the interface loads slowly or crashes?
**Answer**: Solutions for interface issues:
1. **Browser-related**:
- Try using different browsers (Chrome usually has the best compatibility)
- Clear browser cache and cookies
- Disable unnecessary browser extensions
2. **Resource issues**:
- Ensure the system has enough memory
- Close other resource-intensive programs
- If running on a remote server, check the server load
3. **Network issues**:
- Ensure the network connection is stable
- If using through an SSH tunnel, check if the connection is stable
4. **Restart services**:
- Try restarting the Gradio service
- In extreme cases, restart the server
### Q23: Why does my training stop responding midway?
**Answer**: Possible reasons and solutions for training stopping responding:
1. **Resource exhaustion**:
- Insufficient system memory
- GPU memory overflow
- Solution: Reduce batch size, use more efficient training methods, or increase system resources
2. **Process termination**:
- The system's OOM (Out of Memory) killer terminated the process
- Server timeout policies may terminate long-running processes
- Solution: Check system logs, use tools like screen or tmux to run in the background, reduce resource usage
3. **Network or interface issues**:
- Browser crashes or network disconnections
- Solution: Run training via command line, or ensure a stable network connection
4. **Data or code issues**:
- Anomalies or incorrect formats in the dataset causing processing to hang
- Solution: Check the dataset, and test the process with a small subset of data |