Releases: Fenix46/llama.cpp
Releases · Fenix46/llama.cpp
llama-server paged scheduler v0.1.1
macOS arm64 release for the paged llama-server fork.
Included in this package:
- updated paged KV cache reset/release fixes
- corrected sparse block-table handling
- corrected paged allocator block reuse
- paged scheduler fixes for cancel/release and mixed-request batching
- Metal paged attention alignment for runtime
--kv-block-size - Metal paged attention parallelism tuning
- updated launch commands for Metal and CUDA in the bundled README
Recommended Apple Silicon launch command:
./bin/llama-server \
-m /path/to/model.gguf \
--host 127.0.0.1 \
--port 8080 \
-ngl 99 \
--scheduler paged \
--flash-attn on \
-c 128000 \
--max-model-len 128000 \
-b 2048 \
-ub 512 \
--kv-block-size 64 \
--gpu-memory-utilization 0.90 \
--kv-prefix-cache \
--cache-ram 0 \
--webuillama-server paged scheduler v0.1.0
Initial macOS arm64 package for the private llama.cpp paged scheduler fork.