Model Card for security-qwen2.5-3b-coder-instruct Model Description This model, security-qwen2.5-3b-coder-instruct, is a fine-tuned version of QWen2.5-Coder-3B specifically adapted for vulnerability detection in software code. It has been trained on a cleaned version of the ReposVul dataset, which includes vulnerabilities from C, C++, Java, and Python programming languages. The fine-tuning was performed using the LoRA (Low-Rank Adaptation) method to efficiently adapt the base model for this specific task. Intended Uses & Limitations

Intended Uses: This model is designed to assist in identifying potential vulnerabilities in code written in C, C++, Java, and Python. It can be used as part of a security review process to help developers and security professionals find security issues in their codebases. Limitations: While the model performs well in detecting vulnerabilities, its performance may vary when multiple vulnerabilities are present in the same code snippet. It might not always identify all vulnerabilities correctly in such cases. Additionally, the model is trained on specific types of vulnerabilities present in the ReposVul dataset and may not generalize well to other types of vulnerabilities or programming languages not covered in the training data.

How to Use To use this model for vulnerability detection, you can leverage the Hugging Face Transformers library along with PEFT (Parameter-Efficient Fine-Tuning). Here's an example of how to load and use the model: from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch

model_name = "your_username/security-qwen2.5-3b-coder-instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name)

Example usage

code_snippet = """ your code here """ inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, max_length=512) outputs = model.generate(**inputs, max_length=1024) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Please note that actual usage might depend on the specific task and how the model was fine-tuned. The above is a general example. Training Data The model was fine-tuned on a cleaned version of the ReposVul dataset. ReposVul is a high-quality vulnerability dataset that includes 6,134 CVE entries across 1,491 projects in C, C++, Java, and Python, providing multi-granularity information from repository-level to line-level. The dataset was cleaned to improve data quality, with specific cleaning steps detailed in the dataset's README. Training Procedure

Base Model: QWen2.5-Coder-3B Fine-Tuning Method: LoRA (Low-Rank Adaptation) Training Data: Cleaned ReposVul dataset Hardware: [Specify the hardware used, e.g., A100 GPUs] Hyperparameters: [List the hyperparameters used, e.g., learning rate, batch size, number of epochs, etc.]

References

Dataset: ReposVul Paper: A Repository-Level Dataset For Detecting, Classifying and Repairing Software Vulnerabilities Base Model: QWen2.5-Coder-3B

whywhywhywhy
/

security-qwen2.5-3b-coder-instruct

Example usage

Model tree for whywhywhywhy/security-qwen2.5-3b-coder-instruct