You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 4, 2025. It is now read-only.
Continuous batching boosts throughput and minimizes latency in large language model (LLM) inference. This technique groups multiple inference requests, significantly improving GPU utilization.
7
7
8
-
Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization.
8
+
**Key Advantages:**
9
9
10
-
## Why Continuous Batching?
10
+
- Increased Throughput.
11
+
- Reduced Latency.
12
+
- Efficient GPU Use.
11
13
12
-
Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage.
14
+
**Implementation Insight:**
13
15
14
-
## Benefits of Continuous Batching
15
-
16
-
-**Increased Throughput:** Improvement over traditional batching methods.
17
-
-**Reduced Latency:** Lower p50 latency, leading to faster response times.
18
-
-**Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities.
16
+
To evaluate its effectiveness, compare continuous batching with traditional methods. For more details on benchmarking, refer to this [article](https://www.anyscale.com/blog/continuous-batching-llm-inference).
19
17
20
18
## How to use continous batching
21
19
Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency.
For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.
35
-
36
-
### Benchmark and Compare
37
-
38
-
To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency.
32
+
For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.
Copy file name to clipboardExpand all lines: docs/docs/features/embed.md
+1-3Lines changed: 1 addition & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,8 +3,6 @@ title: Embedding
3
3
description: Inference engine for embedding, the same as OpenAI's
4
4
---
5
5
6
-
## What are embeddings?
7
-
8
6
Embeddings are lists of numbers (floats). To find how similar two embeddings are, we measure the [distance](https://en.wikipedia.org/wiki/Cosine_similarity) between them. Shorter distances mean they're more similar; longer distances mean less similarity.
The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) loaded to Nitro server.
Copy file name to clipboardExpand all lines: docs/docs/features/multi-thread.md
+11-28Lines changed: 11 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,34 +3,20 @@ title: Multithreading
3
3
description: Nitro utilizes multithreading to optimize hardware usage.
4
4
---
5
5
6
-
## What is Multithreading?
6
+
Multithreading in programming allows concurrent task execution, improving efficiency and responsiveness. It's key for optimizing hardware and application performance.
7
7
8
-
Multithreading is a programming concept where a process executes multiple threads simultaneously, improving efficiency and performance. It allows concurrent execution of tasks, such as data processing or user interface updates. This technique is crucial for optimizing hardware usage and enhancing application responsiveness.
8
+
Effective multithreading offers:
9
9
10
-
## Drogon's Threading Model
10
+
- Faster Performance.
11
+
- Responsive IO.
12
+
- Deadlock Prevention.
13
+
- Resource Optimization.
14
+
- Asynchronous Programming Support.
15
+
- Scalability Enhancement.
11
16
12
-
Nitro powered by Drogon, a high-speed C++ web application framework, utilizes a thread pool where each thread possesses its own event loop. These event loops are central to Drogon's functionality:
17
+
For more information on threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).
13
18
14
-
-**Main Loop**: Runs on the main thread, responsible for starting worker loops.
15
-
-**Worker Loops**: Handle tasks and network events, ensuring efficient task execution without blocking.
16
-
17
-
## Why it's important
18
-
19
-
Understanding and effectively using multithreading in Drogon is crucial for several reasons:
20
-
21
-
1.**Optimized Performance**: Multithreading enhances application efficiency by enabling simultaneous task execution for faster response times.
22
-
23
-
2.**Non-blocking IO Operations**: Utilizing multiple threads prevents long-running tasks from blocking the entire application, ensuring high responsiveness.
24
-
25
-
3.**Deadlock Avoidance**: Event loops and threads helps prevent deadlocks, ensuring smoother and uninterrupted application operation.
26
-
27
-
4.**Effective Resource Utilization**: Distributing tasks across multiple threads leads to more efficient use of server resources, improving overall performance.
28
-
29
-
5.**Async Programming**
30
-
31
-
6.**Scalability**
32
-
33
-
## Enabling More Threads on Nitro
19
+
## Enabling Multi-Threads on Nitro
34
20
35
21
To increase the number of threads used by Nitro, use the following command syntax:
36
22
@@ -47,7 +33,4 @@ To launch Nitro with 4 threads, enter this command in the terminal:
47
33
nitro 4127.0.0.15000
48
34
```
49
35
50
-
> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.
51
-
52
-
## Acknowledgements
53
-
For more information on Drogon's threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).
36
+
> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.
Copy file name to clipboardExpand all lines: docs/docs/features/warmup.md
+4-10Lines changed: 4 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,17 +3,11 @@ title: Warming Up Model
3
3
description: Nitro warms up the model to optimize delays.
4
4
---
5
5
6
-
## What is Model Warming Up?
7
-
8
-
Model warming up is the process of running pre-requests through a model to optimize its components for production use. This step is crucial for reducing initialization and optimization delays during the first few inference requests.
9
-
10
-
## What are the Benefits?
11
-
12
-
Warming up an AI model offers several key benefits:
13
-
14
-
-**Enhanced Initial Performance:** Unlike in `llama.cpp`, where the first inference can be very slow, warming up reduces initial latency, ensuring quicker response times from the start.
15
-
-**Consistent Response Times:** Especially beneficial for systems updating models frequently, like those with real-time training, to avoid performance lags with new snapshots.
6
+
Model warming up involves pre-running requests through an AI model to fine-tune its components for production. This step minimizes delays during initial inferences, ensuring readiness for immediate use.
16
7
8
+
**Key Advantages:**
9
+
- Improved Initial Performance.
10
+
- Stable Response Times.
17
11
## How to Enable Model Warming Up?
18
12
19
13
On the Nitro server, model warming up is automatically enabled whenever a new model is loaded. This means that the server handles the warm-up process behind the scenes, ensuring that the model is ready for efficient and effective performance from the first inference request.
0 commit comments