File size: 3,375 Bytes

---
language:
  - "en"
thumbnail: "thumbnail_url_here" # Replace with your thumbnail URL
tags:
  - "fine-tuning"
  - "LoRA"
  - "vulnerability-detection"
  - "code-analysis"
license: "apache-2.0" # Replace with your chosen license
datasets:
  - "your_username/cleaned-reposvul" # Update with your dataset URL
base_model: "Qwen/Qwen2.5-Coder-3B"
---

Model Card for security-qwen2.5-3b-coder-instruct
Model Description
This model, security-qwen2.5-3b-coder-instruct, is a fine-tuned version of QWen2.5-Coder-3B specifically adapted for vulnerability detection in software code. It has been trained on a cleaned version of the ReposVul dataset, which includes vulnerabilities from C, C++, Java, and Python programming languages. The fine-tuning was performed using the LoRA (Low-Rank Adaptation) method to efficiently adapt the base model for this specific task.
Intended Uses & Limitations

Intended Uses: This model is designed to assist in identifying potential vulnerabilities in code written in C, C++, Java, and Python. It can be used as part of a security review process to help developers and security professionals find security issues in their codebases.
Limitations: While the model performs well in detecting vulnerabilities, its performance may vary when multiple vulnerabilities are present in the same code snippet. It might not always identify all vulnerabilities correctly in such cases. Additionally, the model is trained on specific types of vulnerabilities present in the ReposVul dataset and may not generalize well to other types of vulnerabilities or programming languages not covered in the training data.

How to Use
To use this model for vulnerability detection, you can leverage the Hugging Face Transformers library along with PEFT (Parameter-Efficient Fine-Tuning). Here's an example of how to load and use the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

model_name = "your_username/security-qwen2.5-3b-coder-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example usage

code_snippet = """
your code here
"""
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, max_length=512)
outputs = model.generate(\*\*inputs, max_length=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Please note that actual usage might depend on the specific task and how the model was fine-tuned. The above is a general example.
Training Data
The model was fine-tuned on a cleaned version of the ReposVul dataset. ReposVul is a high-quality vulnerability dataset that includes 6,134 CVE entries across 1,491 projects in C, C++, Java, and Python, providing multi-granularity information from repository-level to line-level. The dataset was cleaned to improve data quality, with specific cleaning steps detailed in the dataset's README.
Training Procedure

Base Model: QWen2.5-Coder-3B
Fine-Tuning Method: LoRA (Low-Rank Adaptation)
Training Data: Cleaned ReposVul dataset
Hardware: [Specify the hardware used, e.g., A100 GPUs]
Hyperparameters: [List the hyperparameters used, e.g., learning rate, batch size, number of epochs, etc.]

References

Dataset: ReposVul
Paper: A Repository-Level Dataset For Detecting, Classifying and Repairing Software Vulnerabilities
Base Model: QWen2.5-Coder-3B