Skip to content

HNUSystemsLab/LKRepair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

LKRepair: An Automated LLM Patch Generation Method for Linux Kernel Vulnerability Fixes

LKRepair is an LLM-based framework for automatically generating defect-fixing patches, focusing on Linux kernel defect remediation. The framework includes the LKRD dataset, the long-prompt optimization method CoLPO, as well as subsystems such as source-code-level application of Linux kernel patches, multi-node automatic compilation, and multi-node automated verification of fixes. We host the source code and dataset on two open platforms: GitHub and Zenodo. LKRD is the domain-specific dataset within the framework, designed to provide real-world bug-fixing context data for the framework, while also serving as a training dataset for fine-tuning LLMs in the field of Linux kernel bug fixing. This dataset covers defects and fix patches released by the kernel community between 2017 and 2025, comprising a total of 333 kernel sub-versions and 324 major commits, spanning 97 core submodules and 4 cross-subsystem modules. After filtering the originally collected data, we obtained a total of 9,286 complete samples, including 3,112 unique bugs, 2,669 patches, 6,560 source code blocks before and after fixes, and 7,275 patch code blocks. Each sample covers 15 key features. The dataset is stored in a MongoDB database and has a size of approximately 5.32 GB (1.25 GB when compressed). Whether it is LKRD or its generated subsets such as LKRD-C, each sample includes 15 feature fields, specifically: Commit ID, CrashLog, Kconfig, POC-c, POC-syz, Commit ID-fix, CommitID-parent, Patch Code, Email List, Sub-system, Kernel Version, Source Code Chunk-bug, Source Code Chunk-fix, Patch Code Chunk, and Bug Source File Name, as well as some fields used for development tracking.

DataSet

Release address: https://doi.org/10.5281/zenodo.13338271

  • After extracting B-P-Pair.zip, rename the folder to patch_pair and copy it to the pp_pair/ directory in the source code.
  • The data in lkrd.zip and lkrd-C.zip is in MongoDB format and must be imported into a database (named: Scrapy_DB2025). The former is the raw dataset for Linux bug fixes; after filtering, LKRD contains a total of 9,286 complete samples, including 3,112 unique bugs and 2,669 patches. The latter is the result of initial processing on LKRD, including B-P construction, code block segmentation, and semantic annotation, yielding 8,827 records.
  • It is worth noting that during the execution of LKRepair, both datasets must be imported into the database to ensure the program runs properly.

Source Code

Release address: https://github.com/HNUSystemsLab/LKRepair

  • We recommend naming the project LKRepair2025 and setting up the development environment using Anaconda. First, carefully read the contents of readme.md in the root directory. Next, import lkr1.yaml from the root directory to initialize the library environment. Finally, install the database and import the data from LKRD and LKRD-C.
  • The pp_pair directory in the root directory contains the unzipped B-P-Pair data. The LLM folder contains prompt templates and source code for local and remote API interfaces related to the LLM. The plugins directory is for plugin interfaces; for example, the CoLPO plugin for LKRepair. Developers can start exploring third-party plugins from this directory. The config.py file in the public directory configures the project’s basic information, directory structure, LLM API keys, and other data.
  • The entire project uses a simple command-line system for startup, and each file includes comments.

Note

After the article is accepted, upload the source code.

Introduction

Currently, manual patch writing is still the main means of kernel vulnerability repair. Although LLM is outstanding in code generation and error correction, the research on its application in kernel defect repair is slow due to the lack of dedicated data sets and end-to-end frameworks. The existing framework does not do domain adaptation for LLM, and it is difficult to generate high-quality patches due to long input constraints and "central forgetting" problems. Therefore, this paper carries out a systematic research on LLM rapid patch generation. First, build a large-scale dataset lkrd closest to the production environment based on a large number of real crash information and official patches. Secondly, a full process intelligent repair end-to-end framework lkrepair, which covers defect localization, patch generation, automatic compilation and automatic verification, is proposed and implemented. Finally, aiming at the problem of LLM dealing with the limited window of long prompt words, this paper designs a long prompt word optimization method colpo dedicated to kernel code repair, which effectively improves the patch generation ability of the model in the scenario of super long token input. Experiments show that lkrepair can efficiently generate patches only when it provides crash logs and bug code blocks, while colpo can effectively break through the bottleneck of LLM prompt word length limitation on the basis of supporting syntax level cutting, so that the model can handle nearly infinite length context. In general, this work provides a new technical route and basis for kernel defect repair, and lays an important foundation for promoting the intelligent repair of operating system defects.

About

An Intelligent Repair Framework for Generating Linux Kernel Patches with LLMs Guided

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors