[Backport] KVM support THP and huge pages spliting for dirty logging #267
Open
yechao-w wants to merge 18 commits into
Open
[Backport] KVM support THP and huge pages spliting for dirty logging #267yechao-w wants to merge 18 commits into
yechao-w wants to merge 18 commits into
Conversation
mainline inclusion from mainline-6.17-rc1 commit 4a50578 category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_riscv_vcpu_alloc_vector_context() does return an error code upon failure so don't ignore this in kvm_arch_vcpu_create(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-2-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 7c67de2 category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_riscv_vcpu_aia_init() does not return any failure so drop the return value which is always zero. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-3-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit b79bf20 category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_riscv_local_tlb_sanitize() deals with sanitizing current VMID related TLB mappings when a VCPU is moved from one host CPU to another. Let's move kvm_riscv_local_tlb_sanitize() to VMID management sources and rename it to kvm_riscv_gstage_vmid_sanitize(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-4-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 7584eb6 category: feature bugzilla: RVCK-Project#257 -------------------------------- The KVM_REQ_HFENCE_GVMA_VMID_ALL is same as KVM_REQ_TLB_FLUSH so to avoid confusion let's replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH. Also, rename kvm_riscv_hfence_gvma_vmid_all_process() to kvm_riscv_tlb_flush_process(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-5-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit eaa98ba category: feature bugzilla: RVCK-Project#257 -------------------------------- The gstage_set_pte() and gstage_op_pte() should flush TLB only when a leaf PTE changes so that unnecessary TLB flushes can be avoided. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-6-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit ca539ba category: feature bugzilla: RVCK-Project#257 -------------------------------- The kvm_arch_flush_remote_tlbs_range() expected by KVM core can be easily implemented for RISC-V using kvm_riscv_hfence_gvma_vmid_gpa() hence provide it. Also with kvm_arch_flush_remote_tlbs_range() available for RISC-V, the mmu_wp_memory_region() can happily use kvm_flush_remote_tlbs_memslot() instead of kvm_flush_remote_tlbs(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-7-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 77ba646 category: feature bugzilla: RVCK-Project#257 -------------------------------- The H-extension CSRs accessed by kvm_riscv_vcpu_trap_redirect() will trap when KVM RISC-V is running as Guest/VM hence remove these traps by using ncsr_xyz() instead of csr_xyz(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-8-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 4ecbd3e category: feature bugzilla: RVCK-Project#257 -------------------------------- The MMU, TLB, and VMID management for KVM RISC-V already exists as seprate sources so create separate headers along these lines. This further simplifies asm/kvm_host.h header. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-9-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit f035b44 category: feature bugzilla: RVCK-Project#257 -------------------------------- Introduce struct kvm_gstage_mapping which represents a g-stage mapping at a particular g-stage page table level. Also, update the kvm_riscv_gstage_map() to return the g-stage mapping upon success. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-10-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 4c933f3 category: feature bugzilla: RVCK-Project#257 -------------------------------- Currently, the struct kvm_riscv_hfence does not have vmid field and various hfence processing functions always pick vmid assigned to the guest/VM. This prevents us from doing hfence operation on arbitrary vmid hence add vmid field to struct kvm_riscv_hfence and use it wherever applicable. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-11-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit dd82e35 category: feature bugzilla: RVCK-Project#257 -------------------------------- The upcoming nested virtualization can share g-stage page table management with the current host g-stage implementation hence factor-out g-stage page table management as separate sources and also use "kvm_riscv_mmu_" prefix for host g-stage functions. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-12-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.17-rc1 commit 1f6d0ee category: feature bugzilla: RVCK-Project#257 -------------------------------- Currently, all kvm_riscv_hfence_xyz() APIs assume VMID to be the host VMID of the Guest/VM which resticts use of these APIs only for host TLB maintenance. Let's allow passing VMID as a parameter to all kvm_riscv_hfence_xyz() APIs so that they can be re-used for nested virtualization related TLB maintenance. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com> Link: https://lore.kernel.org/r/20250618113532.471448-13-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-6.15-rc1 commit 03dc00a category: feature bugzilla: RVCK-Project#257 -------------------------------- Use RSW0 as the special bit for pmds and puds, just like for ptes. Also define the {pte,pmd,pud}_pgprot helpers which were previously missing and are needed for the follow_pfnmap APIs. Signed-off-by: Andrew Bresticker <abrestic@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20250108135700.2614848-1-abrestic@rivosinc.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
mainline inclusion from mainline-7.0-rc1 commit ed7ae7a category: feature bugzilla: RVCK-Project#257 -------------------------------- Use block mapping if backed by a THP, as implemented in architectures like ARM and x86_64. Signed-off-by: Jessica Liu <liu.xuemei1@zte.com.cn> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20251127165137780QbUOVPKPAfWSGAFl5qtRy@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
|
开始测试 log: https://github.com/RVCK-Project/rvck/actions/runs/25718607314 参数解析结果
测试完成 详细结果:
Kunit Test Result[07:02:17] Testing complete. Ran 457 tests: passed: 445, skipped: 12
Kernel Build Result
Check Patch Result
|
mainline inclusion from mainline-v7.0-rc4 commit b342166 category: feature bugzilla: RVCK-Project#257 -------------------------------- When dirty logging is enabled, guest stage mappings are forced to PAGE_SIZE granularity. Changing the mapping page size at this point is incorrect. Fixes: ed7ae7a ("RISC-V: KVM: Transparent huge page support") Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260226191231140_X1Juus7s2kgVlc0ZyW_K@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org>
…ging mainline inclusion from mainline-7.1-rc1 commit a216e24 category: feature bugzilla: RVCK-Project#257 -------------------------------- When enabling dirty log in small chunks (e.g., QEMU default chunk size of 256K), the chunk size is always smaller than the page size of huge pages (1G or 2M) used in the gstage page tables. This caused the write protection to be incorrectly skipped for huge PTEs because the condition `(end - addr) >= page_size` was not satisfied. Remove the size check in `kvm_riscv_gstage_wp_range()` to ensure huge PTEs are always write-protected regardless of the chunk size. Additionally, explicitly align the address down to the page size before invoking `kvm_riscv_gstage_op_pte()` to guarantee that the address passed to the operation function is page-aligned. This fixes the issue where dirty pages might not be tracked correctly when using huge pages. Fixes: 9d05c1f ("RISC-V: KVM: Implement stage2 page table programming") Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn> Reviewed-by: Nutty Liu <nutty.liu@hotmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/202603301610527120YZ-pAJY6x9SBpSRo1Wg4@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org>
mainline inclusion from mainline-7.1-rc1 commit 6ad36f3 category: feature bugzilla: RVCK-Project#257 -------------------------------- During dirty logging, all huge pages are write-protected. When the guest writes to a write-protected huge page, a page fault is triggered. Before recovering the write permission, the huge page must be split into smaller pages (e.g., 4K). After splitting, the normal mapping process proceeds, allowing write permission to be restored at the smaller page granularity. If dirty logging is disabled because migration failed or was cancelled, only recover the write permission at the 4K level, and skip recovering the huge page mapping at this time to avoid the overhead of freeing page tables. The huge page mapping can be recovered in the ioctl context, similar to x86, in a later patch. Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/202603301612587174XZ6QMCrymBqv30S6BN50@zte.com.cn Signed-off-by: Anup Patel <anup@brainfault.org>
mainline inclusion from mainline-7.0-rc4 commit dec9ed9 category: feature bugzilla: RVCK-Project#257 -------------------------------- While fuzzing KVM on RISC-V, a use-after-free was observed in kvm_riscv_gstage_get_leaf(), where ptep_get() dereferences a freed gstage page table page during gfn unmap. The crash manifests as: use-after-free in ptep_get include/linux/pgtable.h:340 [inline] use-after-free in kvm_riscv_gstage_get_leaf arch/riscv/kvm/gstage.c:89 Call Trace: ptep_get include/linux/pgtable.h:340 [inline] kvm_riscv_gstage_get_leaf+0x2ea/0x358 arch/riscv/kvm/gstage.c:89 kvm_riscv_gstage_unmap_range+0xf0/0x308 arch/riscv/kvm/gstage.c:265 kvm_unmap_gfn_range+0x168/0x1fc arch/riscv/kvm/mmu.c:256 kvm_mmu_unmap_gfn_range virt/kvm/kvm_main.c:724 [inline] page last free pid 808 tgid 808 stack trace: kvm_riscv_mmu_free_pgd+0x1b6/0x26a arch/riscv/kvm/mmu.c:457 kvm_arch_flush_shadow_all+0x1a/0x24 arch/riscv/kvm/mmu.c:134 kvm_flush_shadow_all virt/kvm/kvm_main.c:344 [inline] The UAF is caused by gstage page table walks running concurrently with gstage pgd teardown. In particular, kvm_unmap_gfn_range() can traverse gstage page tables while kvm_arch_flush_shadow_all() frees the pgd, leading to use-after-free of page table pages. Fix the issue by serializing gstage unmap and pgd teardown with kvm->mmu_lock. Holding mmu_lock ensures that gstage page tables remain valid for the duration of unmap operations and prevents concurrent frees. This matches existing RISC-V KVM usage of mmu_lock to protect gstage map/unmap operations, e.g. kvm_riscv_mmu_iounmap. Fixes: dd82e35 ("RISC-V: KVM: Factor-out g-stage page table management") Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn> Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20260202040059.1801167-1-xujiakai2025@iscas.ac.cn Signed-off-by: Anup Patel <anup@brainfault.org> Signed-off-by: Wang Yechao <wang.yechao255@zte.com.cn>
|
开始测试 log: https://github.com/RVCK-Project/rvck/actions/runs/25719749455 参数解析结果
测试完成 详细结果:
Kunit Test Result[07:25:58] Testing complete. Ran 457 tests: passed: 445, skipped: 12
Kernel Build Result
Check Patch Result
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
issues: #257
一、功能介绍
所有补丁来自L0,KVM支持如下两个功能:
1)Gstage支持THP,提升虚拟机内存访问性能;
2)Gstage支持巨页拆分,支持巨页场景(HugeTlb,THP)虚拟机迁移;
二、补丁集合
三、测试验证
1)kvm-unit-tests测试验证
2)巨页虚拟机迁移测试验证
qemu TCG作为risc-v主机,启动以及迁移kvm虚拟机正常。迁移命令:
kvm虚拟机使用的qemu版本:
社区qemu-10.0版本叠加如下补丁:
https://lore.kernel.org/qemu-devel/20260305171958533kl0ISsA1hGkh97tu5LUWy@zte.com.cn/
https://lore.kernel.org/qemu-devel/20250915070811.3422578-2-xb@ultrarisc.com/