arxiv:2504.05866

CTI-HAL: A Human-Annotated Dataset for Cyber Threat Intelligence Analysis

Published on Apr 8

Authors:

Abstract

Organizations are increasingly targeted by Advanced Persistent Threats (APTs), which involve complex, multi-stage tactics and diverse techniques. Cyber Threat Intelligence (CTI) sources, such as incident reports and security blogs, provide valuable insights, but are often unstructured and in natural language, making it difficult to automatically extract information. Recent studies have explored the use of AI to perform automatic extraction from CTI data, leveraging existing CTI datasets for performance evaluation and fine-tuning. However, they present challenges and limitations that impact their effectiveness. To overcome these issues, we introduce a novel dataset manually constructed from CTI reports and structured according to the MITRE ATT&CK framework. To assess its quality, we conducted an inter-annotator agreement study using Krippendorff alpha, confirming its reliability. Furthermore, the dataset was used to evaluate a Large Language Model (LLM) in a real-world business context, showing promising generalizability.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.05866 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.05866 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.05866 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.