Skip to content

Releases: Fenix46/llama.cpp

llama-server paged scheduler v0.1.1

03 May 15:05

Choose a tag to compare

macOS arm64 release for the paged llama-server fork.

Included in this package:

  • updated paged KV cache reset/release fixes
  • corrected sparse block-table handling
  • corrected paged allocator block reuse
  • paged scheduler fixes for cancel/release and mixed-request batching
  • Metal paged attention alignment for runtime --kv-block-size
  • Metal paged attention parallelism tuning
  • updated launch commands for Metal and CUDA in the bundled README

Recommended Apple Silicon launch command:

./bin/llama-server \
  -m /path/to/model.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -ngl 99 \
  --scheduler paged \
  --flash-attn on \
  -c 128000 \
  --max-model-len 128000 \
  -b 2048 \
  -ub 512 \
  --kv-block-size 64 \
  --gpu-memory-utilization 0.90 \
  --kv-prefix-cache \
  --cache-ram 0 \
  --webui

llama-server paged scheduler v0.1.0

30 Apr 18:50

Choose a tag to compare

Initial macOS arm64 package for the private llama.cpp paged scheduler fork.