ThreatDetect-C-Cpp / README.md

Mohammed Sbaihi

Update README.md

6440d4c verified 3 months ago

4.4 kB

	---
	library_name: transformers
	tags:
	- code
	- cybersecurity
	- vulnerability
	- cpp
	license: apache-2.0
	datasets:
	- lemon42-ai/minified-diverseful-multilabels
	metrics:
	- accuracy
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: text-classification
	---

	# Model Card for ThreatDetect-C-Cpp

	<!-- ![deck](deck.png){: width="200px"} -->
	<img src="linkedin-deck.png" width="800">

	This is a derivative version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base). <br>
	We fine-tuned ModernBERT-base to detect vulnerability in C/C++ Code. <br>
	The actual version has an accuracy of 86% <br>

	## Model Details

	### Model Description

	ThreatDetect-C-Cpp can be used as a code classifier. <br>
	Instead of binary classification ("safe", "unsafe"), The model classifies the input code into 7 labels: 'safe' (no vulnerability detected) and six other CWE weaknesses:

	\| Label \| Description \|
	\|---------\|-------------------------------------------------------\|
	\| CWE-119 \| Improper Restriction of Operations within the Bounds of a Memory Buffer \|
	\| CWE-125 \| Out-of-bounds Read \|
	\| CWE-20 \| Improper Input Validation \|
	\| CWE-416 \| Use After Free \|
	\| CWE-703 \| Improper Check or Handling of Exceptional Conditions \|
	\| CWE-787 \| Out-of-bounds Write \|
	\| safe \| Safe code \|


	- Developed by: [lemon42-ai](https://github.com/lemon42-ai)
	- Contributers [Abdellah Oumida](https://www.linkedin.com/in/abdellah-oumida-ab9082234/) & [Mohammed Sbaihi](https://www.linkedin.com/in/mohammed-sbaihi-aa6493254/)
	- Model type: [ModernBERT, Encoder-only Transformer](https://arxiv.org/abs/2412.13663)
	- Supported Programming Languages: C/C++
	- License: Apache 2.0 (see original License of ModernBERT-Base)
	- Finetuned from model: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).

	### Model Sources [optional]


	- Repository: [The official lemon42-ai Github repository](https://github.com/lemon42-ai/ThreatDetect-code-vulnerability-detection)
	- Technical Blog Post: Coming soon.

	## Uses

	ThreadDetect-C-Cpp can be integrated in code-related applications. For example, it can be used in pair with a code generator to detect vulnerabilities in the generated code.



	## Bias, Risks, and Limitations

	ThreadDetect-C-Cpp can detect weaknesses in C/C++ code only. It should not be used with other programming languages.<br>
	The model can only detect the six CWEs in the table above.



	## Training Details

	### Training Data

	The model was fine-tuned on a minified, clean and deduplicated version of [DiverseVul](https://github.com/wagner-group/diversevul) dataset. <br>
	This new version can be explored on HF datasets [HERE](https://huggingface.co/datasets/lemon42-ai/minified-diverseful-multilabels)

	### Training Procedure

	The model was trained using LoRA applied to Q and V matrices.



	#### Training Hyperparameters

	\| Hyperparameter \| Value \|
	\|-------------------------\|---------------------------\|
	\| Max Sequence Length \| 600 \|
	\| Batch Size \| 32 \|
	\| Number of Epochs \| 9 \|
	\| Learning Rate \| 5e-4 \|
	\| Weight Decay \| 0.01 \|
	\| Logging Steps \| 100 \|
	\| LoRA Rank (r) \| 8 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.1 \|
	\| LoRA Target Modules \| attn.Wqkv \|
	\| Optimizer \| AdamW \|
	\| LR Scheduler \| CosineAnnealingWarmRestarts \|
	\| Scheduler T_0 \| 10 \|
	\| Scheduler T_mult \| 2 \|
	\| Scheduler eta_min \| 1e-6 \|
	\| Training Split Ratio \| 90% Train / 10% Validation \|
	\| Seed for Splitting \| 42 \|



	## Evaluation

	ThreatDetect-C-Cpp reaches an accruacy of 86% on the eval set.



	## Technical Specifications


	#### Hardware

	The model was fine-tuned on 4 Tesla V100 GPUs for 1 hour using torch + accelerate frameworks.