no longer works on 96GB RTX Pro 6000 Blackwell with default options (used to)

When I first started using ds4 at d0357ec, it would load and use 93128MiB on my 97887MiB.

```
ds4: context buffers 751.71 MiB (ctx=32768, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=8194)
```

As of 8d45bc4, it used 95420MiB...

```
ds4: context buffers 1051.74 MiB (ctx=32768, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=8194)
```

...but you can make it the smaller size again using DS4_METAL_PREFILL_CHUNK=2048 (if nothing else, so we can have a more stable baseline).

As of 15f42aa, it seems to on-demand load fragments of the model and runs so slow--like, 40x slower--as to be totally unusable. These were dark times ;P.

And, somewhere around here (the merge paths got complex and it sometimes doesn't compile), instead of it being very slow, it just failed with an out of memory error.

```
ds4: CUDA loading model tensors into device cache: 22.03 GiB
ds4: CUDA q8 fp16 cache budget exhausted; using q8 kernels (request=16.00 MiB cached=10.29 GiB free=3.12 GiB reserve=4.75 GiB total=94.97 GiB)
processing 10 input tokens: 10/10 (100.0%)
ds4: CUDA model range alloc failed for token_embd (1010.00 MiB): out of memory
ds4: decode failed: cuda decode failed
```

But, as of 1704eca, it preloads everything and works again, even running at full speed! That said, this is with that prefill chunk override, and it still now takes up 95284MiB... but that does work with the default context size.

The issue now, though, is that that extra bit of memory it now needs as of there means that, if you don't use DS4_METAL_PREFILL_CHUNK=2048, then it doesn't fit anymore, at least with the default context size.

```
$ ./ds4
ds4: CUDA backend initialized on NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (sm_120)
ds4: CUDA host registration skipped: operation not supported
ds4: CUDA preparing model tensor mappings: 80.24 GiB
ds4: CUDA q8 fp16 cache budget exhausted; using q8 kernels (request=64.00 MiB cached=5.55 GiB limit=12.00 GiB free=4.81 GiB reserve=4.75 GiB total=94.97 GiB)
ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 28.495s
ds4: cuda backend initialized for graph diagnostics
ds4: context buffers 1053.75 MiB (ctx=32768, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=8194)
ds4: CUDA tensor alloc failed: out of memory
ds4: interactive chat KV cache requires a session backend
```

This seems to still be the status as of the most recent commit, e34a808, though somewhere along the line (I haven't bothered to bisect this) it started using even slightly more memory: 95412MiB, even when I'm using DS4_METAL_PREFILL_CHUNK=2048; max working context is a bit over 131072.

Regardless, the memory usage of the defaults seem to have slowly bloated to the point where if you just run ./ds4 it doesn't work anymore. If I don't use DS4_METAL_PREFILL_CHUNK=2048 then I have to shrink the context all the way to <=8187 (8192 does not actually fit ;P).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no longer works on 96GB RTX Pro 6000 Blackwell with default options (used to) #431

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

no longer works on 96GB RTX Pro 6000 Blackwell with default options (used to) #431

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions