diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..2597b28a --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,141 @@ +# CLAUDE.md — PyTorchSim quick reference + +Reference notes for working in this repo. The canonical user-facing docs live in `README.md`; this file is a short, opinionated map for development sessions. + +## What this repo is + +PyTorchSim is a cycle-accurate NPU simulation framework. It plugs into the PyTorch 2 `torch.compile` stack via a custom `npu:0` device (PrivateUse1 backend) and runs three coupled simulators per compiled kernel: + +1. **Gem5** (RISC-V) — produces compute-latency tables for the TOG +2. **Spike** — functional simulator that validates generated code correctness +3. **TOGSim** — the project's own cycle-accurate Tile-Operation-Graph simulator that models DRAM (Ramulator2), NoC (BookSim2), L2, systolic arrays, VPU lanes + +The pipeline runs in that order on every `torch.compile` invocation; you'll see the three banners (`[Gem5]`, `[Spike]`, `[TOGSim]`) in the log when something is right. + +## Repo layout (the parts that actually matter) + +| Path | Purpose | +|---|---| +| `PyTorchSimFrontend/` | Python compiler stack (Inductor backend). `extension_config.py` is the central settings reader; `mlir/` contains MLIR templates per op (gemm, conv, bmm, sdpa, sort, cat, maxpool, …) | +| `PyTorchSimDevice/` | C++ PyTorch backend registering the `npu` device. Built as a pip-installed package via `setup.py`. Based on `torch_openreg` (PrivateUse1 example). Produces `_C.cpython-*.so` | +| `Simulator/simulator.py` | Python drivers: `FunctionalSimulator` (Spike), `CycleSimulator` (Gem5), `TOGSimulator` (the cycle-accurate one + multi-tenant context manager) | +| `Scheduler/scheduler.py` | Poisson arrival generator + scheduling utilities for multi-tenant runs | +| `TOGSim/` | C++ TOGSim source. `src/Simulator.cc`, `Core.cc`, `Dram.cc`, `Interconnect.cc`, `L2Cache.cc`, `Tile.cc`, `TileGraph.cc` are the core models. Externals: ramulator2, booksim, stonneCore, onnx, protobuf, spdlog, yaml-cpp | +| `AsmParser/` | `tog_generator.py`, `onnx_utility.py` — TOG generation from ONNX/ASM | +| `configs/` | TOGSim hardware configs (YAML). The default is `systolic_ws_128x128_c1_simple_noc_tpuv3.yml`. Naming pattern: `systolic_ws__c__.yml` | +| `tests/` | ~36 op- and model-level tests. Subdirs `DeepSeek/`, `Diffusion/`, `Llama/`, `MLP/`, `Mixtral_8x7B/`, `MoE/`, `Yolov5/`, `Fusion/` for whole-model workloads | +| `experiments/artifact/` | Paper reproduction scripts (`cycle_validation/run_cycle.sh`, `speedup/run_speedup.sh`) | +| `scripts/` | One-off experiment runners (CompilerOpt, ILS, batch, chiplet, sparsity, stonne, end2end). `build_from_source.sh` builds gem5/llvm/spike | +| `gem5_script/` | gem5 wrapper scripts called by `CycleSimulator` | +| `tpuv4/` | Example SRAM/L2 buffer plans for TPUv4-style persistent cache | +| `togsim_results/` | TOGSim log + trace dump directory (per-run) | +| `outputs/` | Per-run hashed output dirs | + +## Running tests + +Most tests follow the same pattern: build CPU reference, compile via `torch.compile` on `npu:0`, compare with `torch.allclose` (rtol=atol=1e-4). They all have `if __name__ == "__main__"` blocks. + +```bash +python tests/test_add.py # vector add (smoke test, fastest) +python tests/test_matmul.py # GEMM +python tests/test_mlp.py # MLP forward + backward (training path) +python tests/test_scheduler.py # multi-tenant launch_model +python tests/test_eager.py # eager-fallback registration +``` + +Run a model from `tests/Llama/`, `tests/DeepSeek/`, etc. similarly. + +**For fast iteration** (skip functional check): +```bash +export pytorchsim_functional_mode=False # skips Spike +``` + +**To dump intermediate IR while debugging:** +```bash +export TORCHSIM_DUMP_MLIR_IR=1 +export TORCHSIM_DUMP_LLVM_IR=1 +``` + +## Key environment variables + +Read in `PyTorchSimFrontend/extension_config.py`: + +| Var | Default | Purpose | +|---|---|---| +| `TORCHSIM_DIR` | `/workspace/PyTorchSim` | repo root | +| `TOGSIM_CONFIG` | `configs/systolic_ws_128x128_c1_simple_noc_tpuv3.yml` | TOGSim hardware YAML | +| `GEM5_PATH` | `/workspace/gem5/build/RISCV/gem5.opt` | gem5 binary | +| `TORCHSIM_LLVM_PATH` | `/usr/bin` | LLVM tool dir | +| `TORCHSIM_LOG_PATH` | `$TORCHSIM_DIR/togsim_results` | where TOGSim logs go | +| `TORCHSIM_DUMP_PATH` | `$TORCHSIM_DIR` | misc dumps | +| `TORCHSIM_TLS_MODE` | `1` | TLS vs ILS mode | +| `TORCHSIM_USE_TIMING_POOLING` | `0` | lightweight pooling timing | +| `TORCHSIM_DEBUG_MODE` | `0` | extra debug | +| `TORCHSIM_DUMP_MLIR_IR` | `0` | dump MLIR | +| `TORCHSIM_DUMP_LLVM_IR` | `0` | dump LLVM IR | +| `SRAM_BUFFER_PLAN_PATH` | unset | L2/CMEM persistent-cache tensor plan (Python file with `plan = {...}`) | +| `TOGSIM_DEBUG_LEVEL` | unset | passed to TOGSim `--log_level` | + +Note: `TOGSIM_CONFIG` is **overwritten** while inside a `with TOGSimulator(config_path=...)` block (and restored on exit). Compilation reads the same YAML as TOGSim that way. + +## TOGSim YAML knobs (the ones I edit most) + +Located under `configs/*.yml`: + +- `num_cores`, `core_freq_mhz`, `num_systolic_array_per_core` +- `vpu_num_lanes`, `vpu_spad_size_kb_per_lane`, `vpu_vector_length_bits` +- `dram_type` (`ramulator2` | `simple`), `dram_channels`, `dram_freq_mhz`, `ramulator_config_path` +- `icnt_type` (`simple` | `booksim`), `icnt_latency_cycles`, `icnt_freq_mhz`, `icnt_config_path` +- `l2d_type` (e.g., `datacache`), `l2d_config` (AccelSim-format cache config string) +- `pytorchsim_functional_mode` (Spike on/off), `pytorchsim_timing_mode` +- `codegen_mapping_strategy`: `heuristic` | `autotune` | `external-then-heuristic` | `external-then-autotune` +- `codegen_external_mapping_file` (key `"M_N_K"` → `{TILE_M, TILE_K, TILE_N}` JSON) +- `codegen_compiler_optimization`: `"all"` | `"none"` | a list from `{fusion, reduction_epilogue, reduction_reduction, prologue, single_batch_conv, multi_tile_conv, subtile}` +- `num_partition` + `partition: {core_0: 0, core_1: 1}` for multi-tenant `stream_index` mapping + +## Multi-tenant API (Simulator/simulator.py + scheduler) + +```python +from Simulator.simulator import TOGSimulator +from Scheduler.scheduler import poisson_request_generator + +with TOGSimulator(config_path=...): + torch.npu.launch_model(opt_model, x, stream_index=0, timestamp=0) # timestamp in ns + torch.npu.synchronize() # barrier +``` + +`stream_index` must be a valid queue id from the YAML's `partition` map. `timestamp` is nanoseconds; pass Poisson millisecond times × 1e6. + +## Build + +- **Docker (recommended):** `docker run -it --ipc=host --name torchsim -w /workspace/PyTorchSim ghcr.io/psal-postech/torchsim-ci:v1.0.1 bash` +- **TOGSim from source:** `cd TOGSim && mkdir -p build && cd build && conan install .. --build=missing && cmake .. && make -j$(nproc)` +- **PyTorchSimDevice (Python package):** `cd PyTorchSimDevice && python -m pip install --no-build-isolation -e .` +- **gem5 / LLVM+MLIR / Spike from source:** `bash scripts/build_from_source.sh` (clones to `/workspace/{gem5,llvm-project,riscv-isa-sim}`) + +Conan deps for TOGSim: `boost/1.79.0`, `robin-hood-hashing/3.11.5`, `spdlog/1.11.0`, `yaml-cpp/0.8.0`. + +## Where to look for X + +- **Adding a new op (Inductor lowering):** `PyTorchSimFrontend/mlir/mlir_ops.py`, `mlir_lowering.py`, plus a new `mlir__template.py` if it needs its own MLIR template. Decomposition rules: `mlir_decomposition.py`. Scheduling: `mlir_scheduling.py`. Autotune: `mlir_autotune.py`. +- **Adding a PyTorch device op:** `PyTorchSimDevice/csrc/aten/native/*` (Minimal/Extra split mirrors `torch_openreg`). +- **TOGSim hardware model changes:** `TOGSim/src/{Core,Dram,Interconnect,L2Cache,Tile,TileGraph}.cc` + matching `include/*.h`. +- **TOG generation:** `AsmParser/tog_generator.py` builds the raw graph and serializes it via `AsmParser/onnx_utility.py` to **ONNX, which is the on-disk TOG format** consumed by TOGSim. +- **Eager fallback registration:** `torch.npu.register_eager_to_compile([...])` — see `tests/test_eager.py`. +- **Per-run results:** `togsim_results/>.log` (stats) and `.trace` (instruction trace). The path is also printed at the end of every run. +- **Wrapper codegen path:** printed as `Wrapper Codegen Path = /tmp/torchinductor_//...py` — useful for inspecting generated kernel code and tensor names for `SRAM_BUFFER_PLAN_PATH`. + +## Gotchas / things I've already learned + +- The repo expects `python` to be a Python 3.10+ binary with `torch==2.8.0`. The frontend extends the PyTorch 2 Inductor stack — pin to this version. +- The default Gem5 path is hard-coded to `/workspace/gem5/build/RISCV/gem5.opt`. Override with `GEM5_PATH` if you build elsewhere. +- `_C.cpython-311-*.so` and `torch_openreg/lib/` are build artifacts — already in `.gitignore`, don't commit. +- TOGSim creates a per-PID FIFO under `/tmp/togsim_fifo_` for command/event comm; if a previous run crashed and left stale FIFOs, they get cleaned up on the next start, but watch for orphaned processes if you Ctrl-C mid-run. +- Multi-tenant runs **must** use the `with TOGSimulator(...)` context manager — otherwise compile-time `TOGSIM_CONFIG` and runtime config can diverge. +- `pytorchsim_functional_mode` exists as both an **env var** and a **YAML key**; the env var path is via `extension_config.py` while the YAML key is read inside the same module. They should agree. +- "No CUDA runtime is found" warnings on `import torch` are expected — this is a CPU + simulated-NPU environment, not real CUDA. + +## Git workflow (per CONTRIBUTING.md) + +- Fork → branch (`feature/`) → PR against **`develop`**, not `main`. +- Commit prefix style observed: `[Frontend] ...`, `[TOGSim] ...`, etc.