File size: 4,397 Bytes
800ccb8 795f9d6 800ccb8 28743a5 6440d4c 800ccb8 4e46543 795f9d6 0e67373 800ccb8 4e46543 795f9d6 615c05c 795f9d6 800ccb8 795f9d6 800ccb8 795f9d6 800ccb8 615c05c 800ccb8 615c05c 800ccb8 615c05c 800ccb8 615c05c 0904b0f 615c05c 800ccb8 0904b0f 800ccb8 615c05c 800ccb8 20423d6 800ccb8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
---
library_name: transformers
tags:
- code
- cybersecurity
- vulnerability
- cpp
license: apache-2.0
datasets:
- lemon42-ai/minified-diverseful-multilabels
metrics:
- accuracy
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: text-classification
---
# Model Card for ThreatDetect-C-Cpp
<!-- {: width="200px"} -->
<img src="linkedin-deck.png" width="800">
This is a derivative version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). <br>
We fine-tuned ModernBERT-base to detect vulnerability in C/C++ Code. <br>
The actual version has an accuracy of 86% <br>
## Model Details
### Model Description
ThreatDetect-C-Cpp can be used as a code classifier. <br>
Instead of binary classification ("safe", "unsafe"), The model classifies the input code into 7 labels: 'safe' (no vulnerability detected) and six other CWE weaknesses:
| Label | Description |
|---------|-------------------------------------------------------|
| CWE-119 | Improper Restriction of Operations within the Bounds of a Memory Buffer |
| CWE-125 | Out-of-bounds Read |
| CWE-20 | Improper Input Validation |
| CWE-416 | Use After Free |
| CWE-703 | Improper Check or Handling of Exceptional Conditions |
| CWE-787 | Out-of-bounds Write |
| safe | Safe code |
- **Developed by:** [lemon42-ai](https://github.com/lemon42-ai)
- **Contributers** [Abdellah Oumida](https://www.linkedin.com/in/abdellah-oumida-ab9082234/) & [Mohammed Sbaihi](https://www.linkedin.com/in/mohammed-sbaihi-aa6493254/)
- **Model type:** [ModernBERT, Encoder-only Transformer](https://arxiv.org/abs/2412.13663)
- **Supported Programming Languages:** C/C++
- **License:** Apache 2.0 (see original License of ModernBERT-Base)
- **Finetuned from model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).
### Model Sources [optional]
- **Repository:** [The official lemon42-ai Github repository](https://github.com/lemon42-ai/ThreatDetect-code-vulnerability-detection)
- **Technical Blog Post:** Coming soon.
## Uses
ThreadDetect-C-Cpp can be integrated in code-related applications. For example, it can be used in pair with a code generator to detect vulnerabilities in the generated code.
## Bias, Risks, and Limitations
ThreadDetect-C-Cpp can detect weaknesses in C/C++ code only. It should not be used with other programming languages.<br>
The model can only detect the six CWEs in the table above.
## Training Details
### Training Data
The model was fine-tuned on a minified, clean and deduplicated version of [DiverseVul](https://github.com/wagner-group/diversevul) dataset. <br>
This new version can be explored on HF datasets [HERE](https://huggingface.co/datasets/lemon42-ai/minified-diverseful-multilabels)
### Training Procedure
The model was trained using LoRA applied to Q and V matrices.
#### Training Hyperparameters
| Hyperparameter | Value |
|-------------------------|---------------------------|
| Max Sequence Length | 600 |
| Batch Size | 32 |
| Number of Epochs | 9 |
| Learning Rate | 5e-4 |
| Weight Decay | 0.01 |
| Logging Steps | 100 |
| LoRA Rank (r) | 8 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.1 |
| LoRA Target Modules | attn.Wqkv |
| Optimizer | AdamW |
| LR Scheduler | CosineAnnealingWarmRestarts |
| Scheduler T_0 | 10 |
| Scheduler T_mult | 2 |
| Scheduler eta_min | 1e-6 |
| Training Split Ratio | 90% Train / 10% Validation |
| Seed for Splitting | 42 |
## Evaluation
ThreatDetect-C-Cpp reaches an accruacy of 86% on the eval set.
## Technical Specifications
#### Hardware
The model was fine-tuned on 4 Tesla V100 GPUs for 1 hour using torch + accelerate frameworks.
|