Text Classification
Transformers
Safetensors
code
cybersecurity
vulnerability
cpp
File size: 4,397 Bytes
800ccb8
 
795f9d6
 
 
 
 
 
 
 
 
 
 
 
 
800ccb8
 
28743a5
 
 
6440d4c
800ccb8
4e46543
795f9d6
0e67373
800ccb8
 
 
 
 
4e46543
 
 
795f9d6
 
 
 
 
 
 
 
 
 
 
 
615c05c
795f9d6
 
 
 
800ccb8
 
 
 
795f9d6
 
800ccb8
 
 
795f9d6
800ccb8
 
 
 
 
615c05c
 
800ccb8
 
 
 
 
 
 
615c05c
 
800ccb8
 
 
615c05c
800ccb8
 
 
 
 
615c05c
 
 
0904b0f
 
615c05c
 
 
 
 
 
 
 
 
 
 
 
 
 
800ccb8
 
 
 
 
0904b0f
800ccb8
 
 
615c05c
800ccb8
 
 
 
20423d6
800ccb8
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
library_name: transformers
tags:
- code
- cybersecurity
- vulnerability
- cpp
license: apache-2.0
datasets:
- lemon42-ai/minified-diverseful-multilabels
metrics:
- accuracy
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: text-classification
---

# Model Card for ThreatDetect-C-Cpp

<!-- ![deck](deck.png){: width="200px"} -->
<img src="linkedin-deck.png" width="800">

This is a derivative version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). <br>
We fine-tuned ModernBERT-base to detect vulnerability in C/C++ Code. <br>
The actual version has an accuracy of 86% <br>

## Model Details

### Model Description

ThreatDetect-C-Cpp can be used as a code classifier. <br>
Instead of binary classification ("safe", "unsafe"), The model classifies the input code into 7 labels: 'safe' (no vulnerability detected) and six other CWE weaknesses:

| Label  | Description                                         |
|---------|-------------------------------------------------------|
| CWE-119 | Improper Restriction of Operations within the Bounds of a Memory Buffer |
| CWE-125 | Out-of-bounds Read                                    |
| CWE-20  | Improper Input Validation                            |
| CWE-416 | Use After Free                                       |
| CWE-703 | Improper Check or Handling of Exceptional Conditions |
| CWE-787 | Out-of-bounds Write                                  |
| safe | Safe code                                  |


- **Developed by:** [lemon42-ai](https://github.com/lemon42-ai)
- **Contributers** [Abdellah Oumida](https://www.linkedin.com/in/abdellah-oumida-ab9082234/) & [Mohammed Sbaihi](https://www.linkedin.com/in/mohammed-sbaihi-aa6493254/)
- **Model type:** [ModernBERT, Encoder-only Transformer](https://arxiv.org/abs/2412.13663)
- **Supported Programming Languages:** C/C++ 
- **License:** Apache 2.0 (see original License of ModernBERT-Base)
- **Finetuned from model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).

### Model Sources [optional]


- **Repository:** [The official lemon42-ai Github repository](https://github.com/lemon42-ai/ThreatDetect-code-vulnerability-detection)
- **Technical Blog Post:** Coming soon.

## Uses

ThreadDetect-C-Cpp can be integrated in code-related applications. For example, it can be used in pair with a code generator to detect vulnerabilities in the generated code.



## Bias, Risks, and Limitations

ThreadDetect-C-Cpp can detect weaknesses in C/C++ code only. It should not be used with other programming languages.<br>
The model can only detect the six CWEs in the table above. 



## Training Details

### Training Data

The model was fine-tuned on a minified, clean and deduplicated version of [DiverseVul](https://github.com/wagner-group/diversevul) dataset. <br>
This new version can be explored on HF datasets [HERE](https://huggingface.co/datasets/lemon42-ai/minified-diverseful-multilabels)

### Training Procedure

The model was trained using LoRA applied to Q and V matrices.



#### Training Hyperparameters

| Hyperparameter          | Value                      |
|-------------------------|---------------------------|
| Max Sequence Length    | 600                         |
| Batch Size            | 32                          |
| Number of Epochs       | 9                          |
| Learning Rate         | 5e-4                        |
| Weight Decay          | 0.01                        |
| Logging Steps         | 100                         |
| LoRA Rank (r)         | 8                           |
| LoRA Alpha            | 32                          |
| LoRA Dropout          | 0.1                         |
| LoRA Target Modules   | attn.Wqkv                   |
| Optimizer             | AdamW                       |
| LR Scheduler          | CosineAnnealingWarmRestarts |
| Scheduler T_0         | 10                          |
| Scheduler T_mult      | 2                           |
| Scheduler eta_min     | 1e-6                        |
| Training Split Ratio  | 90% Train / 10% Validation  |
| Seed for Splitting    | 42                          |



## Evaluation

ThreatDetect-C-Cpp reaches an accruacy of 86% on the eval set.



## Technical Specifications 


#### Hardware

The model was fine-tuned on 4 Tesla V100 GPUs for 1 hour using torch + accelerate frameworks.