LKRepair is an LLM-based framework for automatically generating defect-fixing patches, focusing on Linux kernel defect remediation. The framework includes the LKRD dataset, the long-prompt optimization method CoLPO, as well as subsystems such as source-code-level application of Linux kernel patches, multi-node automatic compilation, and multi-node automated verification of fixes. We host the source code and dataset on two open platforms: GitHub and Zenodo. LKRD is the domain-specific dataset within the framework, designed to provide real-world bug-fixing context data for the framework, while also serving as a training dataset for fine-tuning LLMs in the field of Linux kernel bug fixing. This dataset covers defects and fix patches released by the kernel community between 2017 and 2025, comprising a total of 333 kernel sub-versions and 324 major commits, spanning 97 core submodules and 4 cross-subsystem modules. After filtering the originally collected data, we obtained a total of 9,286 complete samples, including 3,112 unique bugs, 2,669 patches, 6,560 source code blocks before and after fixes, and 7,275 patch code blocks. Each sample covers 15 key features. The dataset is stored in a MongoDB database and has a size of approximately 5.32 GB (1.25 GB when compressed). Whether it is LKRD or its generated subsets such as LKRD-C, each sample includes 15 feature fields, specifically: Commit ID, CrashLog, Kconfig, POC-c, POC-syz, Commit ID-fix, CommitID-parent, Patch Code, Email List, Sub-system, Kernel Version, Source Code Chunk-bug, Source Code Chunk-fix, Patch Code Chunk, and Bug Source File Name, as well as some fields used for development tracking.
Release address: https://doi.org/10.5281/zenodo.13338271
- After extracting B-P-Pair.zip, rename the folder to
patch_pairand copy it to thepp_pair/directory in the source code. - The data in lkrd.zip and lkrd-C.zip is in MongoDB format and must be imported into a database (named: Scrapy_DB2025). The former is the raw dataset for Linux bug fixes; after filtering, LKRD contains a total of 9,286 complete samples, including 3,112 unique bugs and 2,669 patches. The latter is the result of initial processing on LKRD, including B-P construction, code block segmentation, and semantic annotation, yielding 8,827 records.
- It is worth noting that during the execution of LKRepair, both datasets must be imported into the database to ensure the program runs properly.
Release address: https://github.com/HNUSystemsLab/LKRepair
- We recommend naming the project
LKRepair2025and setting up the development environment using Anaconda. First, carefully read the contents ofreadme.mdin the root directory. Next, importlkr1.yamlfrom the root directory to initialize the library environment. Finally, install the database and import the data fromLKRDandLKRD-C. - The
pp_pairdirectory in the root directory contains the unzipped B-P-Pair data. TheLLMfolder contains prompt templates and source code for local and remote API interfaces related to the LLM. Thepluginsdirectory is for plugin interfaces; for example, the CoLPO plugin for LKRepair. Developers can start exploring third-party plugins from this directory. Theconfig.pyfile in thepublicdirectory configures the project’s basic information, directory structure, LLM API keys, and other data. - The entire project uses a simple command-line system for startup, and each file includes comments.
After the article is accepted, upload the source code.
Currently, manual patch writing is still the main means of kernel vulnerability repair. Although LLM is outstanding in code generation and error correction, the research on its application in kernel defect repair is slow due to the lack of dedicated data sets and end-to-end frameworks. The existing framework does not do domain adaptation for LLM, and it is difficult to generate high-quality patches due to long input constraints and "central forgetting" problems. Therefore, this paper carries out a systematic research on LLM rapid patch generation. First, build a large-scale dataset lkrd closest to the production environment based on a large number of real crash information and official patches. Secondly, a full process intelligent repair end-to-end framework lkrepair, which covers defect localization, patch generation, automatic compilation and automatic verification, is proposed and implemented. Finally, aiming at the problem of LLM dealing with the limited window of long prompt words, this paper designs a long prompt word optimization method colpo dedicated to kernel code repair, which effectively improves the patch generation ability of the model in the scenario of super long token input. Experiments show that lkrepair can efficiently generate patches only when it provides crash logs and bug code blocks, while colpo can effectively break through the bottleneck of LLM prompt word length limitation on the basis of supporting syntax level cutting, so that the model can handle nearly infinite length context. In general, this work provides a new technical route and basis for kernel defect repair, and lays an important foundation for promoting the intelligent repair of operating system defects.