janhq
diff --git a/‎docs/docs/architecture/cortexrc.mdx‎
Lines changed: 5 additions & 2 deletions b/‎docs/docs/architecture/cortexrc.mdx‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/docs/architecture/data-folder.mdx‎
Lines changed: 16 additions & 76 deletions b/‎docs/docs/architecture/data-folder.mdx‎
Lines changed: 16 additions & 76 deletions
diff --git a/‎docs/docs/capabilities/models/index.mdx‎
Lines changed: 10 additions & 4 deletions b/‎docs/docs/capabilities/models/index.mdx‎
Lines changed: 10 additions & 4 deletions
diff --git a/‎docs/docs/capabilities/models/model-yaml.mdx‎
Lines changed: 111 additions & 17 deletions b/‎docs/docs/capabilities/models/model-yaml.mdx‎
Lines changed: 111 additions & 17 deletions
@@ -14,15 +14,17 @@ import TabItem from "@theme/TabItem";
 Cortex.cpp supports reading its configuration from a file called `.cortexrc`. Using this file, you can also change the data folder, Cortex.cpp API server port, and host.
 
 ## File Location
+
 The configuration file is stored in the following locations:
 
 - **Windows**: `C:\Users\<username>\.cortexrc`
 - **Linux**: `/home/<username>/.cortexrc`
 - **macOS**: `/Users/<username>/.cortexrc`
 
 ## Configuration Parameters
+
 You can configure the following parameters in the `.cortexrc` file:
-| Parameter        | Description                                      | Default Value                  |
+| Parameter | Description | Default Value |
 |------------------|--------------------------------------------------|--------------------------------|
 | `dataFolderPath` | Path to the folder where `.cortexrc` located.  | User's home folder.  |
 | `apiServerHost`  | Host address for the Cortex.cpp API server.      | `127.0.0.1`                    |
@@ -37,6 +39,7 @@ You can configure the following parameters in the `.cortexrc` file:
 | `huggingFaceToken`  | HuggingFace token.                            | Empty string                   |
 
 Example of the `.cortexrc` file:
+
 ```
 logFolderPath: /Users/<username>/cortexcpp
 logLlamaCppPath: ./logs/cortex.log
@@ -49,4 +52,4 @@ apiServerPort: 39281
 checkedForUpdateAt: 1730501224
 latestRelease: v1.0.1
 huggingFaceToken: ""
-```
+```
@@ -1,6 +1,6 @@
 ---
-title: Data Folder
-description: Cortex.cpp's data folder.
+title: Data Folder and App Folder
+description: Cortex.cpp's data folder and app folder.
 slug: "data-folder"
 ---
 
@@ -17,37 +17,25 @@ When you install Cortex.cpp, three types of files will be generated on your devi
 - **Configuration Files**
 - **Data Folder**
 
-## Binary Files
+## Binary Files - under the App Folder
 These are the executable files of the Cortex.cpp application. The file format varies depending on the operating system:
 
-- **Windows**: `.exe`
-  - Stable: `C:\Users\<username>\AppData\Local\cortexcpp\cortex.exe`
-  - Beta: `C:\Users\<username>\AppData\Local\cortexcpp-beta\cortex-beta.exe`
-  - Nighty: `C:\Users\<username>\AppData\Local\cortexcpp-nightly\cortex-nightly.exe`
-- **Linux**: `.deb` or `.fedora`
-  - Stable: `/usr/bin/cortexcpp`
-  - Beta: `/usr/bin/cortexcpp-beta`
-  - Nighty: `/usr/bin/cortexcpp-nightly`
-- **macOS**: `.pkg`
-  - Stable: `/usr/local/bin/cortexcpp`
-  - Beta: `/home/<username>/.cortexrc-beta`
-  - Nighty: `/home/<username>/.cortexrc-nightly`
+- **Windows**:
+  - cli: `C:\Users\<username>\AppData\Local\cortexcpp\cortex.exe`
+  - server: `C:\Users\<username>\AppData\Local\cortexcpp\cortex-server.exe`
+- **Linux**:
+  - cli: `/usr/bin/cortex`
+  - server: `/usr/bin/cortex-server`
+- **macOS**:
+  - cli: `/usr/local/bin/cortex`
+  - server: `/usr/local/bin/cortex-server`
 
 ## Cortex.cpp Data Folder
 The data folder stores the engines, models, and logs required by Cortex.cpp. This folder is located at:
 
-- **Windows**:
-  - Stable: `C:\Users\<username>\.cortexcpp`
-  - Beta: `C:\Users\<username>\.cortexcpp-beta`
-  - Nighty: `C:\Users\<username>\.cortexcpp-nightly`
-- **Linux**:
-  - Stable: `/home/<username>/.cortexcpp<env>`
-  - Beta: `/home/<username>/.cortexcpp-beta`
-  - Nighty: `/home/<username>/.cortexcpp-nightly`
-- **macOS**:
-  - Stable: `/Users/<username>\.cortexcpp<env>`
-  - Beta: `/Users/<username>/.cortexcpp-beta`
-  - Nighty: `/Users/<username>/.cortexcpp-nightly`
+- **Windows**: `C:\Users\<username>\cortexcpp`
+- **Linux**: `/home/<username>/cortexcpp`
+- **macOS**: `/Users/<username>\cortexcpp`
 
 ### Folder Structure
 The Cortex.cpp data folder typically follows this structure:
@@ -77,57 +65,9 @@ The Cortex.cpp data folder typically follows this structure:
         └── llamacpp
   ```
   </TabItem>
-  <TabItem value="Beta" label="Beta">
-  ```yaml
-  ~/.cortex-beta
-    ├── models/
-    │   └── model.list
-    │   └── huggingface.co/
-    │       └── <repo_name>/
-              └── <branch_name>/
-                └── model.yaml
-                └── model.gguf
-    │   └── cortex.so/
-    │       └── <repo_name>/
-    │         └── <branch_name>/
-                └── ...engine_files
-                └── model.yaml
-    │   └── imported/
-            └── imported_model.yaml
-    ├── logs/
-    │   └── cortex.txt
-        └── cortex-cli.txt
-    └── engines/
-        └── llamacpp
-  ```
-  </TabItem>
-  <TabItem value="Nightly" label="Nightly">
-  ```yaml
-  ~/.cortex-nightly
-    ├── models/
-    │   └── model.list
-    │   └── huggingface.co/
-    │       └── <repo_name>/
-              └── <branch_name>/
-                └── model.yaml
-                └── model.gguf
-    │   └── cortex.so/
-    │       └── <repo_name>/
-    │         └── <branch_name>/
-                └── ...engine_files
-                └── model.yaml
-    │   └── imported/
-            └── imported_model.yaml
-    ├── logs/
-    │   └── cortex.txt
-        └── cortex-cli.txt
-    └── engines/
-        └── llamacpp
-  ```
-  </TabItem>
 </Tabs>
 
-#### `.cortexcpp<env>`
+#### `cortexcpp`
 The main directory that stores all Cortex-related files, located in the user's home directory.
 #### `models/`
 Contains the AI models used by Cortex for processing and generating responses.
 
@@ -7,17 +7,23 @@ description: The Model section overview
 🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
 :::
 
+Models in cortex.cpp are used for inference purposes (e.g., chat completion, embedding, etc.). We support two types of models: local and remote.
+
+Local models use a local inference engine to run completely offline on your hardware. Currently, we support llama.cpp with the GGUF model format, and we have plans to support TensorRT-LLM and ONNX engines in the future.
+
+Remote models (like OpenAI GPT-4 and Claude 3.5 Sonnet) use remote engines. Support for OpenAI and Anthropic engines is under development and will be available in cortex.cpp soon.
+
 When Cortex.cpp is started, it automatically starts an API server, this is inspired by Docker CLI. This server manages various model endpoints. These endpoints facilitate the following:
 - **Model Operations**: Run and stop models.
 - **Model Management**: Manage your local models.
 :::info
 The model in the API server is automatically loaded/unloaded by using the [`/chat/completions`](/api-reference#tag/inference/post/v1/chat/completions) endpoint.
 :::
 ## Model Formats
-Cortex.cpp supports three model formats:
-- GGUF
-- ONNX
-- TensorRT-LLM
+Cortex.cpp supports three model formats and each model format require specific engine to run:
+- GGUF - run with `llama-cpp` engine
+- ONNX - run with `onnxruntime` engine
+- TensorRT-LLM - run with `tensorrt-llm` engine
 
 :::info
 For details on each format, see the [Model Formats](/docs/capabilities/models/model-yaml#model-formats) page.
 
@@ -10,7 +10,7 @@ import TabItem from "@theme/TabItem";
 🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
 :::
 
-Cortex.cpp uses a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
+Cortex.cpp utilizes a `model.yaml` file to specify the configuration for running a model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the `models` folder.
 
 ## Structure of `model.yaml`
 
@@ -39,6 +39,23 @@ temperature: 0.6     # Ranges: 0 to 1
 frequency_penalty: 0 # Ranges: 0 to 1
 presence_penalty: 0  # Ranges: 0 to 1
 max_tokens: 8192     # Should be default to context length
+seed: -1
+dynatemp_range: 0
+dynatemp_exponent: 1
+top_k: 40
+min_p: 0.05
+tfs_z: 1
+typ_p: 1
+repeat_last_n: 64
+repeat_penalty: 1
+mirostat: false
+mirostat_tau: 5
+mirostat_eta: 0.1
+penalize_nl: false
+ignore_eos: false
+n_probs: 0
+n_parallels: 1
+min_keep: 0
 ## END OPTIONAL
 # END INFERENCE PARAMETERS
 
@@ -54,6 +71,7 @@ prompt_template: |+  # tokenizer.chat_template
 ## BEGIN OPTIONAL
 ctx_len: 0          # llama.context_length | 0 or undefined = loaded from model
 ngl: 33             # Undefined = loaded from model
+engine: llama-cpp
 ## END OPTIONAL
 # END MODEL LOAD PARAMETERS
 
@@ -84,23 +102,59 @@ stop:
   - <|end_of_text|>
   - <|eot_id|>
   - <|eom_id|>
-stream: true         
-top_p: 0.9           
-temperature: 0.6     
-frequency_penalty: 0 
-presence_penalty: 0  
-max_tokens: 8192     
+stream: true # Default true?
+top_p: 0.9 # Ranges: 0 to 1
+temperature: 0.6 # Ranges: 0 to 1
+frequency_penalty: 0 # Ranges: 0 to 1
+presence_penalty: 0 # Ranges: 0 to 1
+max_tokens: 8192 # Should be default to context length
+seed: -1
+dynatemp_range: 0
+dynatemp_exponent: 1
+top_k: 40
+min_p: 0.05
+tfs_z: 1
+typ_p: 1
+repeat_last_n: 64
+repeat_penalty: 1
+mirostat: false
+mirostat_tau: 5
+mirostat_eta: 0.1
+penalize_nl: false
+ignore_eos: false
+n_probs: 0
+n_parallels: 1
+min_keep: 0
+
 ```
 Inference parameters define how the results will be produced. The required parameters include:
-| **Parameter**          | **Description**                                                                      | **Required** |
-|------------------------|--------------------------------------------------------------------------------------|--------------|
-| `top_p`                | The cumulative probability threshold for token sampling.                             | No  |
-| `temperature`          | Controls the randomness of predictions by scaling logits before applying softmax.    | No   |
-| `frequency_penalty`    | Penalizes new tokens based on their existing frequency in the sequence so far.       | No   |
-| `presence_penalty`     | Penalizes new tokens based on whether they appear in the sequence so far.            | No   |
-| `max_tokens`           | Maximum number of tokens in the output.              | No   |
-| `stream`               | Enables or disables streaming mode for the output (true or false).                   | No   |
-| `stop`               | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes   |
+
+| **Parameter** | **Description** | **Required** |
+|---------------|-----------------|--------------|
+| `stream` | Enables or disables streaming mode for the output (true or false). | No |
+| `top_p` | The cumulative probability threshold for token sampling. Ranges from 0 to 1. | No |
+| `temperature` | Controls the randomness of predictions by scaling logits before applying softmax. Ranges from 0 to 1. | No |
+| `frequency_penalty` | Penalizes new tokens based on their existing frequency in the sequence so far. Ranges from 0 to 1. | No |
+| `presence_penalty` | Penalizes new tokens based on whether they appear in the sequence so far. Ranges from 0 to 1. | No |
+| `max_tokens` | Maximum number of tokens in the output for 1 turn. | No |
+| `seed` | Seed for the random number generator. `-1` means no seed. | No |
+| `dynatemp_range` | Dynamic temperature range. | No |
+| `dynatemp_exponent` | Dynamic temperature exponent. | No |
+| `top_k` | The number of most likely tokens to consider at each step. | No |
+| `min_p` | Minimum probability threshold for token sampling. | No |
+| `tfs_z` | The z-score used for Typical token sampling. | No |
+| `typ_p` | The cumulative probability threshold used for Typical token sampling. | No |
+| `repeat_last_n` | Number of previous tokens to penalize for repeating. | No |
+| `repeat_penalty` | Penalty for repeating tokens. | No |
+| `mirostat` | Enables or disables Mirostat sampling (true or false). | No |
+| `mirostat_tau` | Target entropy value for Mirostat sampling. | No |
+| `mirostat_eta` | Learning rate for Mirostat sampling. | No |
+| `penalize_nl` | Penalizes newline tokens (true or false). | No |
+| `ignore_eos` | Ignores the end-of-sequence token (true or false). | No |
+| `n_probs` | Number of probabilities to return. | No |
+| `min_keep` | Minimum number of tokens to keep. | No |
+| `n_parallels` | Number of parallel streams to use. This params allow you to use multiple chat terminal at the same time. Notice that you need to update `ctx_len` coressponding to `n_parallels` (e.g n_parallels=1, ctx_len=2048 -> n_parallels=2, ctx_len=4096. ) | No |
+| `stop` | Specifies the stopping condition for the model, which can be a word, a letter, or a specific text. | Yes |
 
 
 ### Model Load Parameters
@@ -114,14 +168,54 @@ prompt_template: |+
 
 ctx_len: 0          
 ngl: 33 
+engine: llama-cpp
+
 ```
 Model load parameters include the options that control how Cortex.cpp runs the model. The required parameters include:
 | **Parameter**          | **Description**                                                                      | **Required** |
 |------------------------|--------------------------------------------------------------------------------------|--------------|
-| `ngl`                  | Number of attention heads.                                                           | No          |
+| `ngl`                  | Number of model layers will be offload to GPU.                                                           | No          |
 | `ctx_len`              | Context length (maximum number of tokens).                                           | No          |
 | `prompt_template`      | Template for formatting the prompt, including system messages and instructions.      | Yes          |
+| `engine`      | The engine that run model, default to `llama-cpp` for local model with gguf format.      | Yes          |
+
+All parameters from the `model.yml` file are used for running the model via the [CLI chat command](/docs/cli/chat) or [CLI run command](/docs/cli/run). These parameters also act as defaults when using the [model start API](/api-reference#tag/models/post/v1/models/start) through cortex.cpp.
+
+## Runtime parameters
+
+In addition to predefined parameters in `model.yml`, Cortex.cpp supports runtime parameters to override these settings when using the [model start API](/api-reference#tag/models/post/v1/models/start).
+
+### Model start params
+
+Cortex.cpp supports the following parameters when starting a model via the [model start API](/api-reference#tag/models/post/v1/models/start) for the `llama-cpp engine`:
+
+```
+cache_enabled: bool
+ngl: int
+n_parallel: int
+cache_type: string
+ctx_len: int
+
+## Support for vision model
+mmproj: string 
+llama_model_path: string
+model_path: string
+```
+
+| **Parameter**          | **Description**                                                                      | **Required** |
+|------------------------|--------------------------------------------------------------------------------------|--------------|
+| `cache_type`           | Data type of the KV cache in llama.cpp models. Supported types are `f16`, `q8_0`, and `q4_0`, default is `f16`.  | No          |
+| `cache_enabled`           |Enables caching of conversation history for reuse in subsequent requests. Default is `false`  | No          |
+| `mmproj`           |  path to mmproj GGUF model, support for llava model   | No          |
+| `llama_model_path` | path to llm GGUF model  | No          |
+
+These parameters will override the `model.yml` parameters when starting model through the API.
+
+### Chat completion API parameters
+
+The API is accessible at the `/v1/chat/completions` URL and accepts all parameters from the chat completion API as described [API reference](/api-reference#tag/chat/post/v1/chat/completions)
 
+With the `llama-cpp` engine, cortex.cpp accept all parameters from [`model.yml` inference section](#Inference Parameters) and accept all parameters from the chat completion API.
 
 :::info
 You can download all the supported model formats from the following: