Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit f8bf674

Browse files
committed
Merge branch 'dev' of github.com:janhq/nitro into chore/models-api
2 parents 38cde94 + 5338a78 commit f8bf674

File tree

21 files changed

+1663
-450
lines changed

21 files changed

+1663
-450
lines changed

docs/docs/architecture/cortexrc.mdx

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,17 @@ import TabItem from "@theme/TabItem";
1414
Cortex.cpp supports reading its configuration from a file called `.cortexrc`. Using this file, you can also change the data folder, Cortex.cpp API server port, and host.
1515

1616
## File Location
17+
1718
The configuration file is stored in the following locations:
1819

1920
- **Windows**: `C:\Users\<username>\.cortexrc`
2021
- **Linux**: `/home/<username>/.cortexrc`
2122
- **macOS**: `/Users/<username>/.cortexrc`
2223

2324
## Configuration Parameters
25+
2426
You can configure the following parameters in the `.cortexrc` file:
25-
| Parameter | Description | Default Value |
27+
| Parameter | Description | Default Value |
2628
|------------------|--------------------------------------------------|--------------------------------|
2729
| `dataFolderPath` | Path to the folder where `.cortexrc` located. | User's home folder. |
2830
| `apiServerHost` | Host address for the Cortex.cpp API server. | `127.0.0.1` |
@@ -37,6 +39,7 @@ You can configure the following parameters in the `.cortexrc` file:
3739
| `huggingFaceToken` | HuggingFace token. | Empty string |
3840

3941
Example of the `.cortexrc` file:
42+
4043
```
4144
logFolderPath: /Users/<username>/cortexcpp
4245
logLlamaCppPath: ./logs/cortex.log
@@ -49,4 +52,4 @@ apiServerPort: 39281
4952
checkedForUpdateAt: 1730501224
5053
latestRelease: v1.0.1
5154
huggingFaceToken: ""
52-
```
55+
```

docs/docs/architecture/data-folder.mdx

Lines changed: 16 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: Data Folder
3-
description: Cortex.cpp's data folder.
2+
title: Data Folder and App Folder
3+
description: Cortex.cpp's data folder and app folder.
44
slug: "data-folder"
55
---
66

@@ -17,37 +17,25 @@ When you install Cortex.cpp, three types of files will be generated on your devi
1717
- **Configuration Files**
1818
- **Data Folder**
1919

20-
## Binary Files
20+
## Binary Files - under the App Folder
2121
These are the executable files of the Cortex.cpp application. The file format varies depending on the operating system:
2222

23-
- **Windows**: `.exe`
24-
- Stable: `C:\Users\<username>\AppData\Local\cortexcpp\cortex.exe`
25-
- Beta: `C:\Users\<username>\AppData\Local\cortexcpp-beta\cortex-beta.exe`
26-
- Nighty: `C:\Users\<username>\AppData\Local\cortexcpp-nightly\cortex-nightly.exe`
27-
- **Linux**: `.deb` or `.fedora`
28-
- Stable: `/usr/bin/cortexcpp`
29-
- Beta: `/usr/bin/cortexcpp-beta`
30-
- Nighty: `/usr/bin/cortexcpp-nightly`
31-
- **macOS**: `.pkg`
32-
- Stable: `/usr/local/bin/cortexcpp`
33-
- Beta: `/home/<username>/.cortexrc-beta`
34-
- Nighty: `/home/<username>/.cortexrc-nightly`
23+
- **Windows**:
24+
- cli: `C:\Users\<username>\AppData\Local\cortexcpp\cortex.exe`
25+
- server: `C:\Users\<username>\AppData\Local\cortexcpp\cortex-server.exe`
26+
- **Linux**:
27+
- cli: `/usr/bin/cortex`
28+
- server: `/usr/bin/cortex-server`
29+
- **macOS**:
30+
- cli: `/usr/local/bin/cortex`
31+
- server: `/usr/local/bin/cortex-server`
3532

3633
## Cortex.cpp Data Folder
3734
The data folder stores the engines, models, and logs required by Cortex.cpp. This folder is located at:
3835

39-
- **Windows**:
40-
- Stable: `C:\Users\<username>\.cortexcpp`
41-
- Beta: `C:\Users\<username>\.cortexcpp-beta`
42-
- Nighty: `C:\Users\<username>\.cortexcpp-nightly`
43-
- **Linux**:
44-
- Stable: `/home/<username>/.cortexcpp<env>`
45-
- Beta: `/home/<username>/.cortexcpp-beta`
46-
- Nighty: `/home/<username>/.cortexcpp-nightly`
47-
- **macOS**:
48-
- Stable: `/Users/<username>\.cortexcpp<env>`
49-
- Beta: `/Users/<username>/.cortexcpp-beta`
50-
- Nighty: `/Users/<username>/.cortexcpp-nightly`
36+
- **Windows**: `C:\Users\<username>\cortexcpp`
37+
- **Linux**: `/home/<username>/cortexcpp`
38+
- **macOS**: `/Users/<username>\cortexcpp`
5139

5240
### Folder Structure
5341
The Cortex.cpp data folder typically follows this structure:
@@ -77,57 +65,9 @@ The Cortex.cpp data folder typically follows this structure:
7765
└── llamacpp
7866
```
7967
</TabItem>
80-
<TabItem value="Beta" label="Beta">
81-
```yaml
82-
~/.cortex-beta
83-
├── models/
84-
│ └── model.list
85-
│ └── huggingface.co/
86-
│ └── <repo_name>/
87-
└── <branch_name>/
88-
└── model.yaml
89-
└── model.gguf
90-
│ └── cortex.so/
91-
│ └── <repo_name>/
92-
│ └── <branch_name>/
93-
└── ...engine_files
94-
└── model.yaml
95-
│ └── imported/
96-
└── imported_model.yaml
97-
├── logs/
98-
│ └── cortex.txt
99-
└── cortex-cli.txt
100-
└── engines/
101-
└── llamacpp
102-
```
103-
</TabItem>
104-
<TabItem value="Nightly" label="Nightly">
105-
```yaml
106-
~/.cortex-nightly
107-
├── models/
108-
│ └── model.list
109-
│ └── huggingface.co/
110-
│ └── <repo_name>/
111-
└── <branch_name>/
112-
└── model.yaml
113-
└── model.gguf
114-
│ └── cortex.so/
115-
│ └── <repo_name>/
116-
│ └── <branch_name>/
117-
└── ...engine_files
118-
└── model.yaml
119-
│ └── imported/
120-
└── imported_model.yaml
121-
├── logs/
122-
│ └── cortex.txt
123-
└── cortex-cli.txt
124-
└── engines/
125-
└── llamacpp
126-
```
127-
</TabItem>
12868
</Tabs>
12969

130-
#### `.cortexcpp<env>`
70+
#### `cortexcpp`
13171
The main directory that stores all Cortex-related files, located in the user's home directory.
13272
#### `models/`
13373
Contains the AI models used by Cortex for processing and generating responses.

docs/docs/capabilities/models/index.mdx

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,23 @@ description: The Model section overview
77
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
88
:::
99

10+
Models in cortex.cpp are used for inference purposes (e.g., chat completion, embedding, etc.). We support two types of models: local and remote.
11+
12+
Local models use a local inference engine to run completely offline on your hardware. Currently, we support llama.cpp with the GGUF model format, and we have plans to support TensorRT-LLM and ONNX engines in the future.
13+
14+
Remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) use remote engines. Support for OpenAI and Anthropic engines is under development and will be available in cortex.cpp soon.
15+
1016
When Cortex.cpp is started, it automatically starts an API server, this is inspired by Docker CLI. This server manages various model endpoints. These endpoints facilitate the following:
1117
- **Model Operations**: Run and stop models.
1218
- **Model Management**: Manage your local models.
1319
:::info
1420
The model in the API server is automatically loaded/unloaded by using the [`/chat/completions`](/api-reference#tag/inference/post/v1/chat/completions) endpoint.
1521
:::
1622
## Model Formats
17-
Cortex.cpp supports three model formats:
18-
- GGUF
19-
- ONNX
20-
- TensorRT-LLM
23+
Cortex.cpp supports three model formats and each model format require specific engine to run:
24+
- GGUF - run with `llama-cpp` engine
25+
- ONNX - run with `onnxruntime` engine
26+
- TensorRT-LLM - run with `tensorrt-llm` engine
2127

2228
:::info
2329
For details on each format, see the [Model Formats](/docs/capabilities/models/model-yaml#model-formats) page.

docs/docs/capabilities/models/model-yaml.mdx

Lines changed: 111 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import TabItem from "@theme/TabItem";
1010
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
1111
:::
1212

13-
Cortex.cpp uses a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
13+
Cortex.cpp utilizes a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
1414

1515
## Structure of `model.yaml`
1616

@@ -39,6 +39,23 @@ temperature: 0.6 # Ranges: 0 to 1
3939
frequency_penalty: 0 # Ranges: 0 to 1
4040
presence_penalty: 0 # Ranges: 0 to 1
4141
max_tokens: 8192 # Should be default to context length
42+
seed: -1
43+
dynatemp_range: 0
44+
dynatemp_exponent: 1
45+
top_k: 40
46+
min_p: 0.05
47+
tfs_z: 1
48+
typ_p: 1
49+
repeat_last_n: 64
50+
repeat_penalty: 1
51+
mirostat: false
52+
mirostat_tau: 5
53+
mirostat_eta: 0.1
54+
penalize_nl: false
55+
ignore_eos: false
56+
n_probs: 0
57+
n_parallels: 1
58+
min_keep: 0
4259
## END OPTIONAL
4360
# END INFERENCE PARAMETERS
4461

@@ -54,6 +71,7 @@ prompt_template: |+ # tokenizer.chat_template
5471
## BEGIN OPTIONAL
5572
ctx_len: 0 # llama.context_length | 0 or undefined = loaded from model
5673
ngl: 33 # Undefined = loaded from model
74+
engine: llama-cpp
5775
## END OPTIONAL
5876
# END MODEL LOAD PARAMETERS
5977

@@ -84,23 +102,59 @@ stop:
84102
  - <|end_of_text|>
85103
  - <|eot_id|>
86104
  - <|eom_id|>
87-
stream: true
88-
top_p: 0.9
89-
temperature: 0.6
90-
frequency_penalty: 0
91-
presence_penalty: 0
92-
max_tokens: 8192
105+
stream: true # Default true?
106+
top_p: 0.9 # Ranges: 0 to 1
107+
temperature: 0.6 # Ranges: 0 to 1
108+
frequency_penalty: 0 # Ranges: 0 to 1
109+
presence_penalty: 0 # Ranges: 0 to 1
110+
max_tokens: 8192 # Should be default to context length
111+
seed: -1
112+
dynatemp_range: 0
113+
dynatemp_exponent: 1
114+
top_k: 40
115+
min_p: 0.05
116+
tfs_z: 1
117+
typ_p: 1
118+
repeat_last_n: 64
119+
repeat_penalty: 1
120+
mirostat: false
121+
mirostat_tau: 5
122+
mirostat_eta: 0.1
123+
penalize_nl: false
124+
ignore_eos: false
125+
n_probs: 0
126+
n_parallels: 1
127+
min_keep: 0
128+
93129
```
94130
Inference parameters define how the results will be produced. The required parameters include:
95-
| **Parameter** | **Description** | **Required** |
96-
|------------------------|--------------------------------------------------------------------------------------|--------------|
97-
| `top_p` | The cumulative probability threshold for token sampling. | No |
98-
| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. | No |
99-
| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. | No |
100-
| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. | No |
101-
| `max_tokens` | Maximum number of tokens in the output. | No |
102-
| `stream` | Enables or disables streaming mode for the output (true or false). | No |
103-
| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
131+
132+
| **Parameter** | **Description** | **Required** |
133+
|---------------|-----------------|--------------|
134+
| `stream` | Enables or disables streaming mode for the output (true or false). | No |
135+
| `top_p` | The cumulative probability threshold for token sampling. Ranges from 0 to 1. | No |
136+
| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1. | No |
137+
| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1. | No |
138+
| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1. | No |
139+
| `max_tokens` | Maximum number of tokens in the output for 1 turn. | No |
140+
| `seed` | Seed for the random number generator. `-1` means no seed. | No |
141+
| `dynatemp_range` | Dynamic temperature range. | No |
142+
| `dynatemp_exponent` | Dynamic temperature exponent. | No |
143+
| `top_k` | The number of most likely tokens to consider at each step. | No |
144+
| `min_p` | Minimum probability threshold for token sampling. | No |
145+
| `tfs_z` | The z-score used for Typical token sampling. | No |
146+
| `typ_p` | The cumulative probability threshold used for Typical token sampling. | No |
147+
| `repeat_last_n` | Number of previous tokens to penalize for repeating. | No |
148+
| `repeat_penalty` | Penalty for repeating tokens. | No |
149+
| `mirostat` | Enables or disables Mirostat sampling (true or false). | No |
150+
| `mirostat_tau` | Target entropy value for Mirostat sampling. | No |
151+
| `mirostat_eta` | Learning rate for Mirostat sampling. | No |
152+
| `penalize_nl` | Penalizes newline tokens (true or false). | No |
153+
| `ignore_eos` | Ignores the end-of-sequence token (true or false). | No |
154+
| `n_probs` | Number of probabilities to return. | No |
155+
| `min_keep` | Minimum number of tokens to keep. | No |
156+
| `n_parallels` | Number of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update `ctx_len` coressponding to `n_parallels` (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. ) | No |
157+
| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
104158

105159

106160
### Model Load Parameters
@@ -114,14 +168,54 @@ prompt_template: |+
114168
115169
ctx_len: 0
116170
ngl: 33
171+
engine: llama-cpp
172+
117173
```
118174
Model load parameters include the options that control how Cortex.cpp runs the model. The required parameters include:
119175
| **Parameter** | **Description** | **Required** |
120176
|------------------------|--------------------------------------------------------------------------------------|--------------|
121-
| `ngl` | Number of attention heads. | No |
177+
| `ngl` | Number of model layers will be offload to GPU. | No |
122178
| `ctx_len` | Context length (maximum number of tokens). | No |
123179
| `prompt_template` | Template for formatting the prompt, including system messages and instructions. | Yes |
180+
| `engine` | The engine that run model, default to `llama-cpp` for local model with gguf format. | Yes |
181+
182+
All parameters from the `model.yml` file are used for running the model via the [CLI chat command](/docs/cli/chat) or [CLI run command](/docs/cli/run). These parameters also act as defaults when using the [model start API](/api-reference#tag/models/post/v1/models/start) through cortex.cpp.
183+
184+
## Runtime parameters
185+
186+
In addition to predefined parameters in `model.yml`, Cortex.cpp supports runtime parameters to override these settings when using the [model start API](/api-reference#tag/models/post/v1/models/start).
187+
188+
### Model start params
189+
190+
Cortex.cpp supports the following parameters when starting a model via the [model start API](/api-reference#tag/models/post/v1/models/start) for the `llama-cpp engine`:
191+
192+
```
193+
cache_enabled: bool
194+
ngl: int
195+
n_parallel: int
196+
cache_type: string
197+
ctx_len: int
198+
199+
## Support for vision model
200+
mmproj: string
201+
llama_model_path: string
202+
model_path: string
203+
```
204+
205+
| **Parameter** | **Description** | **Required** |
206+
|------------------------|--------------------------------------------------------------------------------------|--------------|
207+
| `cache_type` | Data type of the KV cache in llama.cpp models. Supported types are `f16`, `q8_0`, and `q4_0`, default is `f16`. | No |
208+
| `cache_enabled` |Enables caching of conversation history for reuse in subsequent requests. Default is `false` | No |
209+
| `mmproj` | path to mmproj GGUF model, support for llava model | No |
210+
| `llama_model_path` | path to llm GGUF model | No |
211+
212+
These parameters will override the `model.yml` parameters when starting model through the API.
213+
214+
### Chat completion API parameters
215+
216+
The API is accessible at the `/v1/chat/completions` URL and accepts all parameters from the chat completion API as described [API reference](/api-reference#tag/chat/post/v1/chat/completions)
124217
218+
With the `llama-cpp` engine, cortex.cpp accept all parameters from [`model.yml` inference section](#Inference Parameters) and accept all parameters from the chat completion API.
125219
126220
:::info
127221
You can download all the supported model formats from the following:

0 commit comments

Comments
 (0)