File size: 19,514 Bytes
8918ac7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
# ProFactory Frequently Asked Questions (FAQ)

## Installation and Environment Configuration Issues

### Q1: How to properly install ProFactory?

**Answer**: You can find the installation step in README.md at the root directory.

### Q2: What should I do if I encounter the error "Could not find a specific dependency" during installation?

**Answer**: There are several solutions for this situation:

1. Try installing the problematic dependency individually:
   ```bash

   pip install name_of_the_problematic_library

   ```

2. If it is a CUDA-related library, ensure you have installed a PyTorch version compatible with your CUDA version:
   ```bash

   # For example, for CUDA 11.7

   pip install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html

   ```

3. For some special libraries, you may need to install system dependencies first. For example, on Ubuntu:
   ```bash

   sudo apt-get update

   sudo apt-get install build-essential

   ```

### Q3: How can I check if my CUDA is installed correctly?

**Answer**: You can verify if CUDA is installed correctly by the following methods:

1. Check the CUDA version:
   ```bash

   nvidia-smi

   ```

2. Verify if PyTorch can recognize CUDA in Python:
   ```python

   import torch

   print(torch.cuda.is_available())  # Should return True

   print(torch.cuda.device_count())  # Displays the number of GPUs

   print(torch.cuda.get_device_name(0))  # Displays the GPU name

   ```

3. If PyTorch cannot recognize CUDA, ensure you have installed the matching versions of PyTorch and CUDA.

## Hardware and Resource Issues

### Q4: What should I do if I encounter a "CUDA out of memory" error during runtime?

**Answer**: This error indicates that your GPU memory is insufficient. Solutions include:

1. **Reduce the batch size**: This is the most direct and effective method. Reduce the batch size in the training configuration by half or more.

2. **Use a smaller model**: Choose a pre-trained model with fewer parameters, such as switching from ProtBERT to ESM-1b.

3. **Enable gradient accumulation**: Increase the `gradient_accumulation_steps` parameter value, for example, set it to 2 or 4, which can reduce memory usage without decreasing the effective batch size.

4. **Use mixed precision training**: Enable the `fp16` option in the training options, which can significantly reduce memory usage.

5. **Reduce the maximum sequence length**: If your data allows, you can decrease the `max_seq_length` parameter.

### Q5: How can I determine what batch size I should use?

**Answer**: Determining the appropriate batch size requires balancing memory usage and training effectiveness:

1. **Start small and gradually increase**: Begin with smaller values (like 4 or 8) and gradually increase until memory is close to its limit.

2. **Refer to benchmarks**: For common protein models, most studies use a batch size of 16-64, but this depends on your GPU memory and sequence length.

3. **Monitor the training process**: A larger batch size may make each training iteration more stable but may require a higher learning rate.

4. **Rule of thumb for memory issues**: If you encounter memory errors, first try halving the batch size.

## Dataset Issues

### Q6: How do I prepare a custom dataset?

**Answer**: Preparing a custom dataset requires the following steps:

1. **Format the data**: The data should be organized into a CSV file, containing at least the following columns:
   - `sequence`: The protein sequence, represented using standard amino acid letters
   - Label column: Depending on your task type, this can be numerical (regression) or categorical (classification)

2. **Split the data**: Prepare training, validation, and test sets, such as `train.csv`, `validation.csv`, and `test.csv`.

3. **Upload to Hugging Face**:
   - Create a dataset repository on Hugging Face
   - Upload your CSV file
   - Reference it in ProFactory using the `username/dataset_name` format

4. **Create dataset configuration**: The configuration should include the problem type (regression or classification), number of labels, and evaluation metrics.

### Q7: What should I do if I encounter a format error when importing my dataset?

**Answer**: Common format issues and their solutions:

1. **Incorrect column names**: Ensure the CSV file contains the necessary columns, especially the `sequence` column and label column.

2. **Sequence format issues**:
   - Ensure the sequence contains only valid amino acid letters (ACDEFGHIKLMNPQRSTVWY)
   - Remove spaces, line breaks, or other illegal characters from the sequence
   - Check if the sequence length is within a reasonable range

3. **Encoding issues**: Ensure the CSV file is saved with UTF-8 encoding.

4. **CSV delimiter issues**: Ensure the file uses the correct delimiter (usually a comma). You can use a text editor to view and correct it.

5. **Handling missing values**: Ensure there are no missing values in the data, or handle them appropriately.

### Q8: My dataset is large, and the system loads slowly or crashes. What should I do?

**Answer**: For large datasets, you can:

1. **Reduce the dataset size**: If possible, test your method with a subset of the data first.

2. **Increase data loading efficiency**:
   - Use the `batch_size` parameter to control the amount of data loaded at a time
   - Enable data caching to avoid repeated loading
   - Preprocess the data to reduce file size (e.g., remove unnecessary columns)

3. **Dataset sharding**: Split large datasets into multiple smaller files and process them one by one.

4. **Increase system resources**: If possible, increase RAM or use a server with more memory.

## Training Issues

### Q9: How can I recover if the training suddenly interrupts?

**Answer**: Methods to handle training interruptions:

1. **Check checkpoints**: The system periodically saves checkpoints (usually in the `ckpt` directory). You can recover from the most recent checkpoint:
   - Look for the last saved model file (usually named `checkpoint-X`, where X is the step number)
   - Specify the checkpoint path as the starting point in the training options

2. **Use the checkpoint recovery feature**: Enable the checkpoint recovery option in the training configuration.

3. **Save checkpoints more frequently**: Adjust the frequency of saving checkpoints, for example, save every 500 steps instead of the default every 1000 steps.

### Q10: How can I speed up training if it is very slow?

**Answer**: Methods to speed up training:

1. **Hardware aspects**:
   - Use a more powerful GPU
   - Use multi-GPU training (if supported)
   - Ensure data is stored on an SSD rather than an HDD

2. **Parameter settings**:
   - Use mixed precision training (enable the fp16 option)
   - Increase the batch size (if memory allows)
   - Reduce the maximum sequence length (if the task allows)
   - Decrease validation frequency (the `eval_steps` parameter)

3. **Model selection**:
   - Choose a smaller pre-trained model
   - Use parameter-efficient fine-tuning methods (like LoRA)

### Q11: What does it mean if the loss value does not decrease or if NaN values appear during training?

**Answer**: This usually indicates that there is a problem with the training:

1. **Reasons for loss not decreasing and solutions**:
   - **Learning rate too high**: Try reducing the learning rate, for example, from 5e-5 to 1e-5
   - **Optimizer issues**: Try different optimizers, such as switching from Adam to AdamW
   - **Initialization issues**: Check the model initialization settings
   - **Data issues**: Validate if the training data has outliers or label errors

2. **Reasons for NaN values and solutions**:
   - **Gradient explosion**: Add gradient clipping, set the `max_grad_norm` parameter
   - **Learning rate too high**: Significantly reduce the learning rate
   - **Numerical instability**: This may occur when using mixed precision training; try disabling the fp16 option
   - **Data anomalies**: Check if there are extreme values in the input data

### Q12: What is overfitting, and how can it be avoided?

**Answer**: Overfitting refers to a model performing well on training data but poorly on new data. Methods to avoid overfitting include:

1. **Increase the amount of data**: Use more training data or data augmentation techniques.

2. **Regularization methods**:
   - Add dropout (usually set to 0.1-0.3)
   - Use weight decay
   - Early stopping: Stop training when the validation performance no longer improves

3. **Simplify the model**:
   - Use fewer layers or smaller hidden dimensions
   - Freeze some layers of the pre-trained model (using the freeze method)

4. **Cross-validation**: Use k-fold cross-validation to obtain a more robust model.

## Evaluation Issues

### Q13: How do I interpret evaluation metrics? Which metric is the most important?

**Answer**: Different tasks focus on different metrics:

1. **Classification tasks**:
   - **Accuracy**: The proportion of correct predictions, suitable for balanced datasets
   - **F1 Score**: The harmonic mean of precision and recall, suitable for imbalanced datasets
   - **MCC (Matthews Correlation Coefficient)**: A comprehensive measure of classification performance, more robust to class imbalance
   - **AUROC (Area Under the ROC Curve)**: Measures the model's ability to distinguish between different classes

2. **Regression tasks**:
   - **MSE (Mean Squared Error)**: The sum of the squared differences between predicted and actual values, the smaller the better
   - **RMSE (Root Mean Squared Error)**: The square root of MSE, in the same units as the original data
   - **MAE (Mean Absolute Error)**: The average of the absolute differences between predicted and actual values
   - **R² (Coefficient of Determination)**: Measures the proportion of variance explained by the model, the closer to 1 the better

3. **Most important metric**: Depends on your specific application needs. For example, in drug screening, you may focus more on true positive rates; for structural prediction, you may focus more on RMSE.

### Q14: What should I do if the evaluation results are poor?

**Answer**: Common strategies to improve model performance:

1. **Data quality**:
   - Check for errors or noise in the data
   - Increase the number of training samples
   - Ensure the training and test set distributions are similar

2. **Model adjustments**:
   - Try different pre-trained models
   - Adjust hyperparameters like learning rate and batch size
   - Use different fine-tuning methods (full parameter fine-tuning, LoRA, etc.)

3. **Feature engineering**:
   - Add structural information (e.g., using foldseek features)
   - Consider sequence characteristics (e.g., hydrophobicity, charge, etc.)

4. **Ensemble methods**:
   - Train multiple models and combine results
   - Use cross-validation to obtain a more robust model

### Q15: Why does my model perform much worse on the test set than on the validation set?

**Answer**: Common reasons for decreased performance on the test set:

1. **Data distribution shift**:
   - The training, validation, and test set distributions are inconsistent
   - The test set contains protein families or features not seen during training

2. **Overfitting**:
   - The model overfits the validation set because it was used for model selection
   - Increasing regularization or reducing the number of training epochs may help

3. **Data leakage**:
   - Unintentionally leaking test data information into the training process
   - Ensure data splitting is done before preprocessing to avoid cross-contamination

4. **Randomness**:
   - If the test set is small, results may be influenced by randomness
   - Try training multiple models with different random seeds and averaging the results

## Prediction Issues

### Q16: How can I speed up the prediction process?

**Answer**: Methods to speed up predictions:

1. **Batch prediction**: Use batch prediction mode instead of single-sequence prediction, which can utilize the GPU more efficiently.

2. **Reduce computation**:
   - Use a smaller model or a more efficient fine-tuning method
   - Reduce the maximum sequence length (if possible)

3. **Hardware optimization**:
   - Use a faster GPU or CPU
   - Ensure predictions are done on the GPU rather than the CPU

4. **Model optimization**:
   - Try model quantization (e.g., int8 quantization)
   - Exporting to ONNX format may provide faster inference speeds

### Q17: What could be the reason for the prediction results being significantly different from expectations?

**Answer**: Possible reasons for prediction discrepancies:

1. **Data mismatch**:
   - The sequences being predicted differ from the training data distribution
   - There are significant differences in sequence length, composition, or structural features

2. **Model issues**:
   - The model is under-trained or overfitted
   - An unsuitable pre-trained model was chosen for the task

3. **Parameter configuration**:
   - Ensure the parameters used during prediction (like maximum sequence length) are consistent with those used during training
   - Check if the correct problem type (classification/regression) is being used

4. **Data preprocessing**:
   - Ensure the prediction data undergoes the same preprocessing steps as the training data
   - Check if the sequence format is correct (standard amino acid letters, no special characters)

### Q18: How can I batch predict a large number of sequences?

**Answer**: Steps for efficient batch prediction:

1. **Prepare the input file**:
   - Create a CSV file containing all sequences
   - The file must include a `sequence` column
   - Optionally include an ID or other identifier columns

2. **Use the batch prediction feature**:
   - Go to the prediction tab
   - Select "Batch Prediction" mode
   - Upload the sequence file
   - Set an appropriate batch size (usually 16-32 is a good balance)

3. **Optimize settings**:
   - Increasing the batch size can improve throughput (if memory allows)
   - Reducing unnecessary feature calculations can speed up processing

4. **Result handling**:
   - After prediction is complete, the system will generate a CSV file containing the original sequences and prediction results
   - You can download this file for further analysis

## Model and Result Issues

### Q19: Which pre-trained model should I choose?

**Answer**: Model selection recommendations:

1. **For general tasks**:
   - ESM-2 is suitable for various protein-related tasks, balancing performance and efficiency
   - ProtBERT performs well on certain sequence classification tasks

2. **Considerations**:
   - **Data volume**: When data is limited, a smaller model may be better (to avoid overfitting)
   - **Sequence length**: For long sequences, consider models that support longer contexts
   - **Computational resources**: When resources are limited, choose smaller models or parameter-efficient methods
   - **Task type**: Different models have their advantages in different tasks

3. **Recommended strategy**: If conditions allow, try several different models and choose the one that performs best on the validation set.

### Q20: How do I interpret the loss curve during training?

**Answer**: Guidelines for interpreting the loss curve:

1. **Ideal curve**:
   - Both training loss and validation loss decrease steadily
   - The two curves eventually stabilize and converge
   - The validation loss stabilizes near its lowest point

2. **Common patterns and their meanings**:
   - **Training loss continues to decrease while validation loss increases**: Signal of overfitting; consider increasing regularization
   - **Both losses stagnate at high values**: Indicates underfitting; may need a more complex model or longer training
   - **Curve fluctuates dramatically**: The learning rate may be too high; consider lowering it
   - **Validation loss is lower than training loss**: This may indicate a data splitting issue or batch normalization effect

3. **Adjusting based on the curve**:
   - If validation loss stops improving early, consider early stopping
   - If training loss decreases very slowly, try increasing the learning rate
   - If there are sudden jumps in the curve, check for data issues or learning rate scheduling

### Q21: How do I save and share my model?

**Answer**: Guidelines for saving and sharing models:

1. **Local saving**:
   - After training is complete, the model will be automatically saved in the specified output directory
   - The complete model includes model weights, configuration files, and tokenizer information

2. **Important files**:
   - `pytorch_model.bin`: Model weights
   - `config.json`: Model configuration
   - `special_tokens_map.json` and `tokenizer_config.json`: Tokenizer configuration

3. **Sharing the model**:
   - **Hugging Face Hub**: The easiest way is to upload to Hugging Face
     - Create a model repository
     - Upload your model files
     - Add model descriptions and usage instructions in the readme
   
   - **Local export**: You can also compress the model folder and share it
     - Ensure all necessary files are included
     - Provide environment requirements and usage instructions

4. **Documentation**: Regardless of the sharing method, you should provide:
   - Description of the training data
   - Model architecture and parameters
   - Performance metrics
   - Usage examples

## Interface and Operation Issues

### Q22: What should I do if the interface loads slowly or crashes?

**Answer**: Solutions for interface issues:

1. **Browser-related**:
   - Try using different browsers (Chrome usually has the best compatibility)
   - Clear browser cache and cookies
   - Disable unnecessary browser extensions

2. **Resource issues**:
   - Ensure the system has enough memory
   - Close other resource-intensive programs
   - If running on a remote server, check the server load

3. **Network issues**:
   - Ensure the network connection is stable
   - If using through an SSH tunnel, check if the connection is stable

4. **Restart services**:
   - Try restarting the Gradio service
   - In extreme cases, restart the server

### Q23: Why does my training stop responding midway?

**Answer**: Possible reasons and solutions for training stopping responding:

1. **Resource exhaustion**:
   - Insufficient system memory
   - GPU memory overflow
   - Solution: Reduce batch size, use more efficient training methods, or increase system resources

2. **Process termination**:
   - The system's OOM (Out of Memory) killer terminated the process
   - Server timeout policies may terminate long-running processes
   - Solution: Check system logs, use tools like screen or tmux to run in the background, reduce resource usage

3. **Network or interface issues**:
   - Browser crashes or network disconnections
   - Solution: Run training via command line, or ensure a stable network connection

4. **Data or code issues**:
   - Anomalies or incorrect formats in the dataset causing processing to hang
   - Solution: Check the dataset, and test the process with a small subset of data