[WIP] Dynamic resizing by jiashuy · Pull Request #285 · NVIDIA/recsys-examples

jiashuy · 2026-01-27T02:35:43Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-01-27T02:39:03Z

Greptile Summary

This PR implements dynamic resizing for embedding tables with significant architectural changes:

Key Changes:

Refactored from dual-table architecture (dev_table/uvm_table split) to unified single-table design (self._table)
Implemented VMM (Virtual Memory Management) based extendable buffers for both device and host memory
Added NO_EVICTION mode that triggers automatic rehashing when capacity is exceeded
Simplified CUDA kernels by removing split_index logic throughout optimizer and table operations
All optimizer functions renamed from *_for_combined_table to *_for_table

Critical Issue Found:

Line 1317 in key_value_table.py contains a typo: torch.arrange should be torch.arange. This will cause immediate runtime failure when NO_EVICTION mode generates scores.

Architecture Improvements:

VMM implementation allows dynamic memory extension while preserving base addresses
Cleaner abstraction with ExtendableBuffer interface
Reduced code complexity by unifying memory handling paths

Concerns:

PR is marked as WIP with incomplete checklist (no tests confirmation, no documentation updates)
The rehashing logic in NO_EVICTION mode could lead to unbounded recursion if memory allocation fails repeatedly
Multiple print statements for debugging suggest production-readiness concerns

Confidence Score: 1/5

This PR contains a critical syntax error that will cause runtime failures
Score of 1 reflects the presence of a critical bug (torch.arrange typo) that will immediately break NO_EVICTION mode functionality. While the architectural refactoring appears sound, the syntax error must be fixed before merge.
Pay close attention to corelib/dynamicemb/dynamicemb/key_value_table.py which contains the critical typo on line 1317, and verify the NO_EVICTION mode logic thoroughly once fixed.

Important Files Changed

Filename	Overview
corelib/dynamicemb/dynamicemb/key_value_table.py	Major refactoring from dual-table (dev/uvm) to single table architecture. Contains critical typo: `torch.arrange` instead of `torch.arange` on line 1317 that will cause runtime errors.
corelib/dynamicemb/src/vmm_tensor.cu	New file implementing VMM-based memory management for device and host memory. Includes proper error handling and memory cleanup in destructors.
corelib/dynamicemb/src/dynamic_emb_op.cu	Simplified from combined table (dev/uvm split) to single table operations. Removed split_index logic and consolidated load/store operations.
corelib/dynamicemb/src/optimizer.cu	Refactored optimizer functions from `_for_combined_table` to `_for_table`, removing dev/uvm split logic. Cleaner single-table interface.
corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py	Added NO_EVICTION score strategy support and improved cache decision logic with better HBM budget handling.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Table as DynamicEmbeddingTable
    participant KVMap as KeyIndexMap (Hash Table)
    participant VMM as VMMTensor/HostVMMTensor
    participant Buffer as ExtendableBuffer
    participant CUDA as CUDA/CU Driver API

    Note over App,CUDA: Initialization Flow
    App->>Table: Create table with capacity
    Table->>Buffer: Create VmmDeviceBuffer/RegisteredHostBuffer
    Buffer->>VMM: Initialize VMMTensor(capacity, dtype, device)
    VMM->>CUDA: cuMemAddressReserve (reserve VA space)
    VMM->>CUDA: cuMemCreate + cuMemMap (allocate initial pages)
    VMM->>CUDA: cuMemSetAccess (set permissions)
    VMM-->>Buffer: Return tensor wrapper
    Table->>KVMap: Create hash table with capacity
    
    Note over App,CUDA: Insert Flow (NO_EVICTION Mode)
    App->>Table: insert(keys, values)
    Table->>Table: Generate scores = arange(valid_rows, valid_rows+n)
    Table->>KVMap: insert_and_evict(keys, scores)
    alt Hash table full (eviction occurs)
        KVMap-->>Table: Return evicted_keys, evicted_scores
        Table->>Table: load_from_table(evicted_scores, evicted_values)
        Table->>Table: rehash(capacity * 2)
        Table->>KVMap: Create new hash table (2x capacity)
        Table->>Table: Re-insert all entries to new hash table
        Table->>Buffer: extend(capacity * 2)
        Buffer->>VMM: extend(new_capacity)
        VMM->>CUDA: cuMemCreate + cuMemMap (map new pages)
        VMM->>CUDA: cuMemSetAccess (set new page permissions)
        Table->>Table: store_to_table(scores, values)
        Table->>Table: Recursive insert(evicted_keys, evicted_values)
    else No eviction
        Table->>Table: store_to_table(scores, values)
    end
    
    Note over App,CUDA: Lookup and Update Flow
    App->>Table: update(keys, grads)
    Table->>KVMap: lookup(keys) → indices
    Table->>CUDA: optimizer_update_kernel(grads, table, indices)
    CUDA->>VMM: Read/write embedding table via mapped memory
    
    Note over App,CUDA: Memory Extension
    Note right of VMM: All extensions preserve base address
    VMM->>CUDA: Map additional pages at offset
    CUDA-->>VMM: Extended virtual address range

greptile-apps

_{19 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T02:39:02Z

corelib/dynamicemb/dynamicemb/key_value_table.py

            policy = ScorePolicy.ASSIGN

+        if self._no_eviction and scores is None:
+            scores = torch.arrange(


syntax: torch.arrange is not a valid PyTorch function, should be torch.arange

Suggested change

scores = torch.arrange(

scores = torch.arange(

jiashuy added 4 commits January 27, 2026 02:34

Remove supports to concated table.

f173c83

dynamic resizing

0c62b54

Support VMM Extendable buffer

85f5fbc

Support host registered pinned memory as extendable buffer

c0e6409

jiashuy added the dynamicemb Related with dynamicemb label Jan 27, 2026

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Dynamic resizing#285

[WIP] Dynamic resizing#285
jiashuy wants to merge 4 commits intoNVIDIA:mainfrom
jiashuy:dynamic_resizing

jiashuy commented Jan 27, 2026

Uh oh!

greptile-apps bot commented Jan 27, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jiashuy commented Jan 27, 2026

Description

Checklist

Uh oh!

greptile-apps bot commented Jan 27, 2026

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant