Update README.md
Browse files
README.md
CHANGED
@@ -45,22 +45,26 @@ The DocuMint model can be used directly to generate high-quality docstrings for
|
|
45 |
|
46 |
The training data consists of 100,000 Python functions and their docstrings extracted from popular open-source repositories in the FLOSS ecosystem. Repositories were filtered based on metrics such as number of contributors (> 50), commits (> 5k), stars (> 35k), and forks (> 10k) to focus on well-established and actively maintained projects.
|
47 |
|
48 |
-
An abstract syntax tree (AST) based parser was used to extract functions and docstrings. Challenges in the data sampling process included syntactic errors, multi-language repositories, computational expense, repository size discrepancies, and ensuring diversity while avoiding repetition.
|
49 |
-
|
50 |
#### Training Hyperparameters
|
51 |
|
52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
-
Fine-tuning was performed using an Intel 12900K CPU, an Nvidia RTX-3090 GPU, and 64 GB RAM. Total fine-tuning time was 48 GPU hours.
|
55 |
|
|
|
56 |
|
57 |
|
58 |
## Evaluation
|
59 |
|
60 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
61 |
|
62 |
-
### Testing Data, Factors & Metrics
|
63 |
-
|
64 |
#### Metrics
|
65 |
|
66 |
- **Accuracy:** Measures the coverage of the generated docstring on code elements like input/output variables. Calculated using cosine similarity between the generated and expert docstring embeddings.
|
|
|
45 |
|
46 |
The training data consists of 100,000 Python functions and their docstrings extracted from popular open-source repositories in the FLOSS ecosystem. Repositories were filtered based on metrics such as number of contributors (> 50), commits (> 5k), stars (> 35k), and forks (> 10k) to focus on well-established and actively maintained projects.
|
47 |
|
|
|
|
|
48 |
#### Training Hyperparameters
|
49 |
|
50 |
+
| Hyperparameter | Value |
|
51 |
+
|-------------------------------|---------------|
|
52 |
+
| Fine-tuning Method | LoRA |
|
53 |
+
| Epochs | 4 |
|
54 |
+
| Batch Size | 8 |
|
55 |
+
| Gradient Accumulation Steps | 16 |
|
56 |
+
| Initial Learning Rate | 2e-4 |
|
57 |
+
| LoRA Parameters | 78,446,592 |
|
58 |
+
| Training Tokens | 185,040,896 |
|
59 |
|
|
|
60 |
|
61 |
+
Fine-tuning was performed using an Intel 12900K CPU, an Nvidia RTX-3090 GPU, and 64 GB RAM. Total fine-tuning time was 48 GPU hours.
|
62 |
|
63 |
|
64 |
## Evaluation
|
65 |
|
66 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
67 |
|
|
|
|
|
68 |
#### Metrics
|
69 |
|
70 |
- **Accuracy:** Measures the coverage of the generated docstring on code elements like input/output variables. Calculated using cosine similarity between the generated and expert docstring embeddings.
|