Skip to main navigation Skip to search Skip to main content

Improving the Classification of Cybersecurity Attack Procedures using Retrieval Augmented Generation

Research output: Contribution to journalArticlepeer-review

Abstract

Understanding the tactics (why), techniques (how) and procedures (methods) behind a cybersecurity attack is paramount to develop defenses against them or to mitigate their effects. However, this task requires a high-level of technical expertise, is time-consuming and error prone. In this work we verify that open-source Llama 3.1 LLMs (Large Language Models) cannot automatically identify which of the 625 MITRE techniques is used within a cybersecurity attack procedure. We evaluate two RAG (Retrieval Augmented Generation) approaches to enhance the classification accuracy. Our experiments show the importance of the embedding model in information retrieval. Moreover, our analysis shows that selecting appropriate examples helps the language model reduce ambiguity. Specifically, a dynamic few-shot learning strategy performs best for larger models, whereas a multiple-choice strategy is more appropriate for smaller models. In contrast, corrective RAG techniques fail to provide significant enhancements, highlighting current methodological limitations and the inherent complexity of this task.

Translated title of the contributionClasificación de Procedimientos de Ataques de Ciberseguridad mediante Generación Aumentada por Recuperación
Original languageEnglish
Pages (from-to)199-210
Number of pages12
JournalProcesamiento del Lenguaje Natural
Volume75
Publication statusPublished - Sept 2025

Keywords

  • Cyber-security
  • open-source LLM
  • RAG
  • text embedding

Fingerprint

Dive into the research topics of 'Improving the Classification of Cybersecurity Attack Procedures using Retrieval Augmented Generation'. Together they form a unique fingerprint.

Cite this