Skip to content
This repository was archived by the owner on Jul 4, 2025. It is now read-only.

Commit 5b432e8

Browse files
committed
succinct voice tone for features
1 parent 925a482 commit 5b432e8

File tree

5 files changed

+25
-56
lines changed

5 files changed

+25
-56
lines changed

docs/docs/features/chat.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: Inference engine for chat completion, the same as OpenAI's
55

66
The Chat Completion feature in Nitro provides a flexible way to interact with any local Large Language Model (LLM).
77

8-
## Single Request Example
8+
### Single Request Example
99

1010
To send a single query to your chosen LLM, follow these steps:
1111

docs/docs/features/cont-batch.md

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,17 @@ title: Continuous Batching
33
description: Nitro's continuous batching combines multiple requests, enhancing throughput.
44
---
55

6-
## What is continous batching?
6+
Continuous batching boosts throughput and minimizes latency in large language model (LLM) inference. This technique groups multiple inference requests, significantly improving GPU utilization.
77

8-
Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization.
8+
**Key Advantages:**
99

10-
## Why Continuous Batching?
10+
- Increased Throughput.
11+
- Reduced Latency.
12+
- Efficient GPU Use.
1113

12-
Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage.
14+
**Implementation Insight:**
1315

14-
## Benefits of Continuous Batching
15-
16-
- **Increased Throughput:** Improvement over traditional batching methods.
17-
- **Reduced Latency:** Lower p50 latency, leading to faster response times.
18-
- **Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities.
16+
To evaluate its effectiveness, compare continuous batching with traditional methods. For more details on benchmarking, refer to this [article](https://www.anyscale.com/blog/continuous-batching-llm-inference).
1917

2018
## How to use continous batching
2119
Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency.
@@ -31,8 +29,4 @@ curl http://localhost:3928/inferences/llamacpp/loadmodel \
3129
}'
3230
```
3331

34-
For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.
35-
36-
### Benchmark and Compare
37-
38-
To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency.
32+
For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.

docs/docs/features/embed.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@ title: Embedding
33
description: Inference engine for embedding, the same as OpenAI's
44
---
55

6-
## What are embeddings?
7-
86
Embeddings are lists of numbers (floats). To find how similar two embeddings are, we measure the [distance](https://en.wikipedia.org/wiki/Cosine_similarity) between them. Shorter distances mean they're more similar; longer distances mean less similarity.
97

108
## Activating Embedding Feature
@@ -44,7 +42,7 @@ curl https://api.openai.com/v1/embeddings \
4442

4543
</div>
4644

47-
## Embedding Reponse
45+
### Embedding Reponse
4846

4947
The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) loaded to Nitro server.
5048

docs/docs/features/multi-thread.md

Lines changed: 11 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -3,34 +3,20 @@ title: Multithreading
33
description: Nitro utilizes multithreading to optimize hardware usage.
44
---
55

6-
## What is Multithreading?
6+
Multithreading in programming allows concurrent task execution, improving efficiency and responsiveness. It's key for optimizing hardware and application performance.
77

8-
Multithreading is a programming concept where a process executes multiple threads simultaneously, improving efficiency and performance. It allows concurrent execution of tasks, such as data processing or user interface updates. This technique is crucial for optimizing hardware usage and enhancing application responsiveness.
8+
Effective multithreading offers:
99

10-
## Drogon's Threading Model
10+
- Faster Performance.
11+
- Responsive IO.
12+
- Deadlock Prevention.
13+
- Resource Optimization.
14+
- Asynchronous Programming Support.
15+
- Scalability Enhancement.
1116

12-
Nitro powered by Drogon, a high-speed C++ web application framework, utilizes a thread pool where each thread possesses its own event loop. These event loops are central to Drogon's functionality:
17+
For more information on threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).
1318

14-
- **Main Loop**: Runs on the main thread, responsible for starting worker loops.
15-
- **Worker Loops**: Handle tasks and network events, ensuring efficient task execution without blocking.
16-
17-
## Why it's important
18-
19-
Understanding and effectively using multithreading in Drogon is crucial for several reasons:
20-
21-
1. **Optimized Performance**: Multithreading enhances application efficiency by enabling simultaneous task execution for faster response times.
22-
23-
2. **Non-blocking IO Operations**: Utilizing multiple threads prevents long-running tasks from blocking the entire application, ensuring high responsiveness.
24-
25-
3. **Deadlock Avoidance**: Event loops and threads helps prevent deadlocks, ensuring smoother and uninterrupted application operation.
26-
27-
4. **Effective Resource Utilization**: Distributing tasks across multiple threads leads to more efficient use of server resources, improving overall performance.
28-
29-
5. **Async Programming**
30-
31-
6. **Scalability**
32-
33-
## Enabling More Threads on Nitro
19+
## Enabling Multi-Threads on Nitro
3420

3521
To increase the number of threads used by Nitro, use the following command syntax:
3622

@@ -47,7 +33,4 @@ To launch Nitro with 4 threads, enter this command in the terminal:
4733
nitro 4 127.0.0.1 5000
4834
```
4935

50-
> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.
51-
52-
## Acknowledgements
53-
For more information on Drogon's threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).
36+
> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.

docs/docs/features/warmup.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,11 @@ title: Warming Up Model
33
description: Nitro warms up the model to optimize delays.
44
---
55

6-
## What is Model Warming Up?
7-
8-
Model warming up is the process of running pre-requests through a model to optimize its components for production use. This step is crucial for reducing initialization and optimization delays during the first few inference requests.
9-
10-
## What are the Benefits?
11-
12-
Warming up an AI model offers several key benefits:
13-
14-
- **Enhanced Initial Performance:** Unlike in `llama.cpp`, where the first inference can be very slow, warming up reduces initial latency, ensuring quicker response times from the start.
15-
- **Consistent Response Times:** Especially beneficial for systems updating models frequently, like those with real-time training, to avoid performance lags with new snapshots.
6+
Model warming up involves pre-running requests through an AI model to fine-tune its components for production. This step minimizes delays during initial inferences, ensuring readiness for immediate use.
167

8+
**Key Advantages:**
9+
- Improved Initial Performance.
10+
- Stable Response Times.
1711
## How to Enable Model Warming Up?
1812

1913
On the Nitro server, model warming up is automatically enabled whenever a new model is loaded. This means that the server handles the warm-up process behind the scenes, ensuring that the model is ready for efficient and effective performance from the first inference request.

0 commit comments

Comments
 (0)