Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit e72eec2

Browse files
Update model.yml documentation
1 parent bd38e71 commit e72eec2

File tree

2 files changed

+41
-2
lines changed

2 files changed

+41
-2
lines changed

docs/docs/capabilities/models/index.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@ description: The Model section overview
88
:::
99

1010
Models in cortex.cpp are used for inference purposes (e.g., chat completion, embedding, etc.). We support two types of models: local and remote.
11+
1112
Local models use a local inference engine to run completely offline on your hardware. Currently, we support llama.cpp with the GGUF model format, and we have plans to support TensorRT-LLM and ONNX engines in the future.
13+
1214
Remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) use remote engines. Support for OpenAI and Anthropic engines is under development and will be available in cortex.cpp soon.
1315

1416
When Cortex.cpp is started, it automatically starts an API server, this is inspired by Docker CLI. This server manages various model endpoints. These endpoints facilitate the following:

docs/docs/capabilities/models/model-yaml.mdx

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import TabItem from "@theme/TabItem";
1010
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
1111
:::
1212

13-
Cortex.cpp uses a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
13+
Cortex.cpp utilizes a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
1414

1515
## Structure of `model.yaml`
1616

@@ -174,11 +174,48 @@ engine: llama-cpp
174174
Model load parameters include the options that control how Cortex.cpp runs the model. The required parameters include:
175175
| **Parameter** | **Description** | **Required** |
176176
|------------------------|--------------------------------------------------------------------------------------|--------------|
177-
| `ngl` | Number of attention heads. | No |
177+
| `ngl` | Number of model layers will be offload to GPU. | No |
178178
| `ctx_len` | Context length (maximum number of tokens). | No |
179179
| `prompt_template` | Template for formatting the prompt, including system messages and instructions. | Yes |
180180
| `engine` | The engine that run model, default to `llama-cpp` for local model with gguf format. | Yes |
181181

182+
All parameters from the `model.yml` file are used for running the model via the [CLI chat command](/docs/cli/chat) or [CLI run command](/docs/cli/run). These parameters also act as defaults when using the [model start API](/api-reference#tag/models/post/v1/models/start) through cortex.cpp.
183+
184+
## Runtime parameters
185+
186+
In addition to predefined parameters in `model.yml`, Cortex.cpp supports runtime parameters to override these settings when using the [model start API](/api-reference#tag/models/post/v1/models/start).
187+
188+
### Model start params
189+
190+
Cortex.cpp supports the following parameters when starting a model via the [model start API](/api-reference#tag/models/post/v1/models/start) for the `llama-cpp engine`:
191+
192+
```
193+
cache_enabled: bool
194+
ngl: int
195+
n_parallel: int
196+
cache_type: string
197+
ctx_len: int
198+
199+
## Support for vision model
200+
mmproj: string
201+
llama_model_path: string
202+
model_path: string
203+
```
204+
205+
| **Parameter** | **Description** | **Required** |
206+
|------------------------|--------------------------------------------------------------------------------------|--------------|
207+
| `cache_type` | Data type of the KV cache in llama.cpp models. Supported types are `f16`, `q8_0`, and `q4_0`, default is `f16`. | No |
208+
| `cache_enabled` |Enables caching of conversation history for reuse in subsequent requests. Default is `false` | No |
209+
210+
211+
These parameters will override the `model.yml` parameters when starting model through the API.
212+
213+
### Chat completion API parameters
214+
215+
The API is accessible at the `/v1/chat/completions` URL and accepts all parameters from the chat completion API as described [API reference](/api-reference#tag/chat/post/v1/chat/completions)
216+
217+
With the `llama-cpp` engine, cortex.cpp accept all parameters from [`model.yml` inference section](#Inference Parameters) and accept all parameters from the chat completion API.
218+
182219
:::info
183220
You can download all the supported model formats from the following:
184221
- [Cortex Model Repos](/docs/hub/cortex-hub)

0 commit comments

Comments
 (0)