Skip to content

[Backport] KVM support THP and huge pages spliting for dirty logging #267

Open
yechao-w wants to merge 18 commits into
RVCK-Project:rvck-6.6from
yechao-w:thp-split
Open

[Backport] KVM support THP and huge pages spliting for dirty logging #267
yechao-w wants to merge 18 commits into
RVCK-Project:rvck-6.6from
yechao-w:thp-split

Conversation

@yechao-w
Copy link
Copy Markdown
Contributor

issues: #257

一、功能介绍
所有补丁来自L0,KVM支持如下两个功能:
1)Gstage支持THP,提升虚拟机内存访问性能;
2)Gstage支持巨页拆分,支持巨页场景(HugeTlb,THP)虚拟机迁移;

二、补丁集合

补丁集 编号 Commit 信息 补丁一致性
MMU related improvements for KVM RISC-V 1/12 4a50578 RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value 一致
2/12 7c67de2 RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init() 一致
3/12 b79bf20 RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize() 一致
4/12 7584eb6 RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH 一致
5/12 eaa98ba RISC-V: KVM: Don't flush TLB when PTE is unchanged 一致
6/12 ca539ba RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range() 一致
7/12 77ba646 RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect() 一致
8/12 4ecbd3e RISC-V: KVM: Factor-out MMU related declarations into separate headers 一致
9/12 f035b44 RISC-V: KVM: Introduce struct kvm_gstage_mapping 基线差异,下游特有函数kvm_set_spte_gfn()的必要适配
10/12 4c933f3 RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence 一致
11/12 dd82e35 RISC-V: KVM: Factor-out g-stage page table management 基线差异,下游特有函数kvm_set_spte_gfn()的必要适配
12/12 1f6d0ee RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs 一致
fix dec9ed9 RISC-V: KVM: Fix use-after-free in kvm_riscv_gstage_get_leaf() 一致
riscv: Support huge pfnmaps 1 03dc00a riscv: Support huge pfnmaps 一致
RISC-V: KVM: Transparent huge page support 1 ed7ae7a RISC-V: KVM: Transparent huge page support 一致
1 b342166 RISC-V: KVM: Skip THP support check during dirty logging 一致
2 a216e24 RISC-V: KVM: Fix lost write protection on huge pages during dirty logging 一致
RISC-V: KVM: Split huge pages 3 6ad36f3 RISC-V: KVM: Split huge pages during fault handling for dirty logging 一致

三、测试验证
1)kvm-unit-tests测试验证

[root@tcg kvm-unit-tests]# ./run_tests.sh -g selftest
PASS selftest (7 tests, 1 skipped)

2)巨页虚拟机迁移测试验证
qemu TCG作为risc-v主机,启动以及迁移kvm虚拟机正常。迁移命令:

[root@tcg migration]# virsh migrate riscv-vm qemu+ssh://192.168.124.128/system  tcp://192.168.124.128 --live --unsafe
内存类型 启动 内存迁移
THP 正常 正常
hugetlb 2M 正常 正常
hugetlb 1G 正常 正常

kvm虚拟机使用的qemu版本:
社区qemu-10.0版本叠加如下补丁:
https://lore.kernel.org/qemu-devel/20260305171958533kl0ISsA1hGkh97tu5LUWy@zte.com.cn/
https://lore.kernel.org/qemu-devel/20250915070811.3422578-2-xb@ultrarisc.com/

avpatel and others added 14 commits May 6, 2026 16:47
mainline inclusion
from mainline-6.17-rc1
commit 4a50578
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The kvm_riscv_vcpu_alloc_vector_context() does return an error code
upon failure so don't ignore this in kvm_arch_vcpu_create().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-2-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit 7c67de2
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The kvm_riscv_vcpu_aia_init() does not return any failure so drop
the return value which is always zero.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-3-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit b79bf20
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The kvm_riscv_local_tlb_sanitize() deals with sanitizing current
VMID related TLB mappings when a VCPU is moved from one host CPU
to another.

Let's move kvm_riscv_local_tlb_sanitize() to VMID management
sources and rename it to kvm_riscv_gstage_vmid_sanitize().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-4-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit 7584eb6
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The KVM_REQ_HFENCE_GVMA_VMID_ALL is same as KVM_REQ_TLB_FLUSH so
to avoid confusion let's replace KVM_REQ_HFENCE_GVMA_VMID_ALL with
KVM_REQ_TLB_FLUSH. Also, rename kvm_riscv_hfence_gvma_vmid_all_process()
to kvm_riscv_tlb_flush_process().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-5-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit eaa98ba
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The gstage_set_pte() and gstage_op_pte() should flush TLB only when
a leaf PTE changes so that unnecessary TLB flushes can be avoided.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-6-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit ca539ba
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The kvm_arch_flush_remote_tlbs_range() expected by KVM core can be
easily implemented for RISC-V using kvm_riscv_hfence_gvma_vmid_gpa()
hence provide it.

Also with kvm_arch_flush_remote_tlbs_range() available for RISC-V, the
mmu_wp_memory_region() can happily use kvm_flush_remote_tlbs_memslot()
instead of kvm_flush_remote_tlbs().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-7-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit 77ba646
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The H-extension CSRs accessed by kvm_riscv_vcpu_trap_redirect() will
trap when KVM RISC-V is running as Guest/VM hence remove these traps
by using ncsr_xyz() instead of csr_xyz().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-8-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit 4ecbd3e
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The MMU, TLB, and VMID management for KVM RISC-V already exists as
seprate sources so create separate headers along these lines. This
further simplifies asm/kvm_host.h header.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-9-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit f035b44
category: feature
bugzilla: RVCK-Project#257

--------------------------------

Introduce struct kvm_gstage_mapping which represents a g-stage
mapping at a particular g-stage page table level. Also, update
the kvm_riscv_gstage_map() to return the g-stage mapping upon
success.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-10-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit 4c933f3
category: feature
bugzilla: RVCK-Project#257

--------------------------------

Currently, the struct kvm_riscv_hfence does not have vmid field
and various hfence processing functions always pick vmid assigned
to the guest/VM. This prevents us from doing hfence operation on
arbitrary vmid hence add vmid field to struct kvm_riscv_hfence
and use it wherever applicable.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-11-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit dd82e35
category: feature
bugzilla: RVCK-Project#257

--------------------------------

The upcoming nested virtualization can share g-stage page table
management with the current host g-stage implementation hence
factor-out g-stage page table management as separate sources
and also use "kvm_riscv_mmu_" prefix for host g-stage functions.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-12-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.17-rc1
commit 1f6d0ee
category: feature
bugzilla: RVCK-Project#257

--------------------------------

Currently, all kvm_riscv_hfence_xyz() APIs assume VMID to be the
host VMID of the Guest/VM which resticts use of these APIs only
for host TLB maintenance. Let's allow passing VMID as a parameter
to all kvm_riscv_hfence_xyz() APIs so that they can be re-used
for nested virtualization related TLB maintenance.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-13-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-6.15-rc1
commit 03dc00a
category: feature
bugzilla: RVCK-Project#257

--------------------------------

Use RSW0 as the special bit for pmds and puds, just like for ptes.
Also define the {pte,pmd,pud}_pgprot helpers which were previously
missing and are needed for the follow_pfnmap APIs.

Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250108135700.2614848-1-abrestic@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion
from mainline-7.0-rc1
commit ed7ae7a
category: feature
bugzilla: RVCK-Project#257

--------------------------------

Use block mapping if backed by a THP, as implemented in architectures
like ARM and x86_64.

Signed-off-by: Jessica Liu <liu.xuemei1@zte.com.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20251127165137780QbUOVPKPAfWSGAFl5qtRy@zte.com.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026


开始测试 log: https://github.com/RVCK-Project/rvck/actions/runs/25718607314

参数解析结果
args value
repository RVCK-Project/rvck
head ref pull/267/head
base ref rvck-6.6
LAVA repo RVCK-Project/lavaci
LAVA hardware
LAVA Testcase path
need run job kunit-test,kernel-build,check-patch,lava-trigger

测试完成

详细结果:
check result
kunit-test success
kernel-build failure
check-patch failure
lava-trigger-qemu skipped
lava-trigger-sg2042 skipped
lava-trigger-k1 skipped
lava-trigger-lpi4a skipped

Kunit Test Result

[07:02:17] Testing complete. Ran 457 tests: passed: 445, skipped: 12

Kernel Build Result

Check Patch Result

Total Errors 1
Total Warnings 3

yechao-w and others added 4 commits May 12, 2026 15:17
mainline inclusion
from mainline-v7.0-rc4
commit b342166
category: feature
bugzilla: RVCK-Project#257

--------------------------------

When dirty logging is enabled, guest stage mappings are forced to
PAGE_SIZE granularity. Changing the mapping page size at this point
is incorrect.

Fixes: ed7ae7a ("RISC-V: KVM: Transparent huge page support")
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260226191231140_X1Juus7s2kgVlc0ZyW_K@zte.com.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
…ging

mainline inclusion
from mainline-7.1-rc1
commit a216e24
category: feature
bugzilla: RVCK-Project#257

--------------------------------

When enabling dirty log in small chunks (e.g., QEMU default chunk
size of 256K), the chunk size is always smaller than the page size
of huge pages (1G or 2M) used in the gstage page tables. This caused
the write protection to be incorrectly skipped for huge PTEs because
the condition `(end - addr) >= page_size` was not satisfied.

Remove the size check in `kvm_riscv_gstage_wp_range()` to ensure huge
PTEs are always write-protected regardless of the chunk size. Additionally,
explicitly align the address down to the page size before invoking
`kvm_riscv_gstage_op_pte()` to guarantee that the address passed to the
operation function is page-aligned.

This fixes the issue where dirty pages might not be tracked correctly
when using huge pages.

Fixes: 9d05c1f ("RISC-V: KVM: Implement stage2 page table programming")
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/202603301610527120YZ-pAJY6x9SBpSRo1Wg4@zte.com.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
mainline inclusion
from mainline-7.1-rc1
commit 6ad36f3
category: feature
bugzilla: RVCK-Project#257

--------------------------------

During dirty logging, all huge pages are write-protected. When the guest
writes to a write-protected huge page, a page fault is triggered. Before
recovering the write permission, the huge page must be split into smaller
pages (e.g., 4K). After splitting, the normal mapping process proceeds,
allowing write permission to be restored at the smaller page granularity.

If dirty logging is disabled because migration failed or was cancelled,
only recover the write permission at the 4K level, and skip recovering the
huge page mapping at this time to avoid the overhead of freeing page tables.
The huge page mapping can be recovered in the ioctl context, similar to x86,
in a later patch.

Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/202603301612587174XZ6QMCrymBqv30S6BN50@zte.com.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
mainline inclusion
from mainline-7.0-rc4
commit dec9ed9
category: feature
bugzilla: RVCK-Project#257

--------------------------------

While fuzzing KVM on RISC-V, a use-after-free was observed in
kvm_riscv_gstage_get_leaf(),  where ptep_get() dereferences a
freed gstage page table page during gfn unmap.

The crash manifests as:
  use-after-free in ptep_get include/linux/pgtable.h:340 [inline]
  use-after-free in kvm_riscv_gstage_get_leaf arch/riscv/kvm/gstage.c:89
  Call Trace:
    ptep_get include/linux/pgtable.h:340 [inline]
    kvm_riscv_gstage_get_leaf+0x2ea/0x358 arch/riscv/kvm/gstage.c:89
    kvm_riscv_gstage_unmap_range+0xf0/0x308 arch/riscv/kvm/gstage.c:265
    kvm_unmap_gfn_range+0x168/0x1fc arch/riscv/kvm/mmu.c:256
    kvm_mmu_unmap_gfn_range virt/kvm/kvm_main.c:724 [inline]
  page last free pid 808 tgid 808 stack trace:
    kvm_riscv_mmu_free_pgd+0x1b6/0x26a arch/riscv/kvm/mmu.c:457
    kvm_arch_flush_shadow_all+0x1a/0x24 arch/riscv/kvm/mmu.c:134
    kvm_flush_shadow_all virt/kvm/kvm_main.c:344 [inline]

The UAF is caused by gstage page table walks running concurrently with
gstage pgd teardown. In particular, kvm_unmap_gfn_range() can traverse
gstage page tables while kvm_arch_flush_shadow_all() frees the pgd,
leading to use-after-free of page table pages.

Fix the issue by serializing gstage unmap and pgd teardown with
kvm->mmu_lock. Holding mmu_lock ensures that gstage page tables
remain valid for the duration of unmap operations and prevents
concurrent frees.

This matches existing RISC-V KVM usage of mmu_lock to protect gstage
map/unmap operations, e.g. kvm_riscv_mmu_iounmap.

Fixes: dd82e35 ("RISC-V: KVM: Factor-out g-stage page table management")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260202040059.1801167-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026


开始测试 log: https://github.com/RVCK-Project/rvck/actions/runs/25719749455

参数解析结果
args value
repository RVCK-Project/rvck
head ref pull/267/head
base ref rvck-6.6
LAVA repo RVCK-Project/lavaci
LAVA hardware
LAVA Testcase path
need run job kunit-test,kernel-build,check-patch,lava-trigger

测试完成

详细结果:
check result
kunit-test success
kernel-build success
check-patch success
lava-trigger-qemu skipped
lava-trigger-sg2042 skipped
lava-trigger-k1 skipped
lava-trigger-lpi4a skipped

Kunit Test Result

[07:25:58] Testing complete. Ran 457 tests: passed: 445, skipped: 12

Kernel Build Result

Check Patch Result

Total Errors 0
Total Warnings 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants