Skip to content

Commit 62e3603

Browse files
authored
Create readme.md
1 parent 40760d3 commit 62e3603

File tree

1 file changed

+284
-0
lines changed

1 file changed

+284
-0
lines changed

LLM/Quantization/readme.md

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
# Quantizing Llama 2 70B
2+
This sample provides a step-by-step walkthrough on quantizing a Llama 2 70B model to 4 bit weights, so it can fit 2xA10 GPUs.
3+
The sample uses GPTQ in HuggingFace transformer to reduce the weights parameters to 4 bit so the model weighs about 35GB in memory.
4+
5+
## Prerequisites
6+
* [Create an object storage bucket](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#2-object-storage) - to save the quantized model to the mdoel catalog
7+
* [Set the policies](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#3-oci-policies) - to allow the OCI Data Science Service resources to access object storage buckets, networking and others
8+
* [Notebook session](https://docs.oracle.com/en-us/iaas/data-science/using/manage-notebook-sessions.htm) - to run this sample. Use the **VM.GPU.A10.2** shape for the notebook session.
9+
* [Access token from HuggingFace](https://huggingface.co/docs/hub/security-tokens) to download Llama2 model. The pre-trained model can be obtained from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/models?sort=trending&search=meta-llama%2Fllama-2). In this example, we will use the [HuggingFace access token](https://huggingface.co/docs/hub/security-tokens) to download the pre-trained model from HuggingFace (by setting the __HUGGING_FACE_HUB_TOKEN__ environment variable).
10+
* Log in to HuggingFace with the auth token:
11+
* Open a terminal window in the notebook session
12+
* enter the command line: huggingface-cli login
13+
* paste the auth token
14+
* see more information [here](https://huggingface.co/docs/huggingface_hub/quick-start#login)
15+
* Install required python libraries (from terminal window):
16+
```python
17+
pip install "transformers[sentencepiece]==4.32.1" "optimum==1.12.0" "auto-gptq==0.4.2" "accelerate==0.22.0" "safetensors>=0.3.1" --upgrade
18+
```
19+
20+
## Load the full model
21+
We can load the full model using the device_map="auto" argument. This will use CPU to store the weights that cannot be loaded into the GPUs.
22+
23+
```python
24+
import torch
25+
from transformers import AutoModelForCausalLM, AutoTokenizer
26+
27+
model_id = "meta-llama/Llama-2-70b-hf"
28+
29+
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
30+
31+
model_full = AutoModelForCausalLM.from_pretrained(
32+
model_id,
33+
low_cpu_mem_usage=True,
34+
torch_dtype=torch.float16,
35+
device_map="auto",
36+
)
37+
```
38+
39+
By looking at the device map, we can see many layers are loaded into the CPU memory.
40+
```python
41+
model_full.hf_device_map
42+
```
43+
44+
<details>
45+
<summary>Full 70B model device map on A10.2</summary>
46+
47+
{'model.embed_tokens': 0,
48+
'model.layers.0': 1,
49+
'model.layers.1': 1,
50+
'model.layers.2': 1,
51+
'model.layers.3': 1,
52+
'model.layers.4': 1,
53+
'model.layers.5': 'cpu',
54+
'model.layers.6': 'cpu',
55+
'model.layers.7': 'cpu',
56+
'model.layers.8': 'cpu',
57+
'model.layers.9': 'cpu',
58+
'model.layers.10': 'cpu',
59+
'model.layers.11': 'cpu',
60+
'model.layers.12': 'cpu',
61+
'model.layers.13': 'cpu',
62+
'model.layers.14': 'cpu',
63+
'model.layers.15': 'cpu',
64+
'model.layers.16': 'cpu',
65+
'model.layers.17': 'cpu',
66+
'model.layers.18': 'cpu',
67+
'model.layers.19': 'cpu',
68+
'model.layers.20': 'cpu',
69+
'model.layers.21': 'cpu',
70+
'model.layers.22': 'cpu',
71+
'model.layers.23': 'cpu',
72+
'model.layers.24': 'cpu',
73+
'model.layers.25': 'cpu',
74+
'model.layers.26': 'cpu',
75+
'model.layers.27': 'cpu',
76+
'model.layers.28': 'cpu',
77+
'model.layers.29': 'cpu',
78+
'model.layers.30': 'cpu',
79+
'model.layers.31': 'cpu',
80+
'model.layers.32': 'cpu',
81+
'model.layers.33': 'cpu',
82+
'model.layers.34': 'cpu',
83+
'model.layers.35': 'cpu',
84+
'model.layers.36': 'cpu',
85+
'model.layers.37': 'cpu',
86+
'model.layers.38': 'cpu',
87+
'model.layers.39': 'cpu',
88+
'model.layers.40': 'cpu',
89+
'model.layers.41': 'cpu',
90+
'model.layers.42': 'cpu',
91+
'model.layers.43': 'cpu',
92+
'model.layers.44': 'cpu',
93+
'model.layers.45': 'cpu',
94+
'model.layers.46': 'cpu',
95+
'model.layers.47': 'cpu',
96+
'model.layers.48': 'cpu',
97+
'model.layers.49': 'cpu',
98+
'model.layers.50': 'cpu',
99+
'model.layers.51': 'cpu',
100+
'model.layers.52': 'cpu',
101+
'model.layers.53': 'cpu',
102+
'model.layers.54': 'cpu',
103+
'model.layers.55': 'cpu',
104+
'model.layers.56': 'cpu',
105+
'model.layers.57': 'cpu',
106+
'model.layers.58': 'cpu',
107+
'model.layers.59': 'cpu',
108+
'model.layers.60': 'cpu',
109+
'model.layers.61': 'cpu',
110+
'model.layers.62': 'cpu',
111+
'model.layers.63': 'cpu',
112+
'model.layers.64': 'cpu',
113+
'model.layers.65': 'cpu',
114+
'model.layers.66': 'cpu',
115+
'model.layers.67': 'cpu',
116+
'model.layers.68': 'cpu',
117+
'model.layers.69': 'cpu',
118+
'model.layers.70': 'cpu',
119+
'model.layers.71': 'cpu',
120+
'model.layers.72': 'cpu',
121+
'model.layers.73': 'cpu',
122+
'model.layers.74': 'cpu',
123+
'model.layers.75': 'cpu',
124+
'model.layers.76': 'cpu',
125+
'model.layers.77': 'cpu',
126+
'model.layers.78': 'cpu',
127+
'model.layers.79': 'cpu',
128+
'model.norm': 'cpu',
129+
'lm_head': 'cpu'}
130+
131+
</details>
132+
133+
## Quantize the model
134+
It is possible to quantize the model on the A10.2 shapes. However, the max sequence length is limited due to the GPU RAM available.
135+
Quantization requires a dataset to calibrate the quantized model. In this example we use the 'wikitext2' dataset.
136+
137+
We need to specify the maximum memory for the GPUs to load the model as we need to keep some extra memory for the quantization. Here we are specifying max_memory of 5GB for each GPU when loading the model.
138+
Due to the size of the model and the limited memory on A10, for the quantization, we need to limit the maximum sequence length that the model can take (model_seqlen) to 128. You may increase this number by reducing the max_memory used by each CPU when loading the model.
139+
We also need to set max_split_size_mb for PyTorch to reduce fragmentation.
140+
141+
The following parameters have been found to work when quantizing on A10.2:
142+
max_split_size_mb = 512
143+
max_memory = 5GB (for each GPU)
144+
model_seqlen = 128
145+
146+
```python
147+
import os
148+
import torch
149+
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
150+
151+
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
152+
153+
model_id = "meta-llama/Llama-2-70b-hf"
154+
dataset_id = "wikitext2"
155+
156+
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
157+
gptq_config = GPTQConfig(bits=4, dataset=dataset_id, tokenizer=tokenizer, model_seqlen=128)
158+
159+
model_quantized = AutoModelForCausalLM.from_pretrained(
160+
model_id,
161+
torch_dtype=torch.float16,
162+
device_map="auto",
163+
max_memory={0: "5GB", 1: "5GB", "cpu": "400GB"},
164+
quantization_config=gptq_config,
165+
)
166+
```
167+
168+
The process will show the progress over the 80 layers. It will take about 1 hour with the wikitext2 dataset.
169+
170+
Note that we cannot run inference on this particular "quantized model" as some "blocks" are loaded across multiple devices. For inferencing, we need to save the model and load it back.
171+
172+
Save the quantized model and the tokenizer:
173+
```python
174+
save_folder = "Llama-2-70b-hf-quantized"
175+
model_quantized.save_pretrained(save_folder)
176+
tokenizer.save_pretrained(save_folder)
177+
```
178+
179+
Since the model was partially offloaded it set disable_exllama to True to avoid an error. For inference and production load we want to leverage the exllama kernels. Therefore we need to change the config.json:
180+
Edit the config.json file, find the key 'disable_exllama' and set it to 'False'.
181+
182+
## Working with the quantized model
183+
Now that the model is saved to the disk, we can see that its size is about 34.1GB. That aligns with our calculations, and can fit 2xA10 GPUs, which can handle up to 48GB in memory.
184+
185+
We can load the quantized model back without the max_memory limit:
186+
```python
187+
tokenizer = AutoTokenizer.from_pretrained(save_folder)
188+
model_quantized = AutoModelForCausalLM.from_pretrained(
189+
save_folder,
190+
device_map="auto",
191+
)
192+
```
193+
194+
By checking the device map, we see the entire model is loaded into GPUs:
195+
<details>
196+
<summary>Quantized model device map on A10.2</summary>
197+
198+
{'model.embed_tokens': 0,
199+
'model.layers.0': 0,
200+
'model.layers.1': 0,
201+
'model.layers.2': 0,
202+
'model.layers.3': 0,
203+
'model.layers.4': 0,
204+
'model.layers.5': 0,
205+
'model.layers.6': 0,
206+
'model.layers.7': 0,
207+
'model.layers.8': 0,
208+
'model.layers.9': 0,
209+
'model.layers.10': 0,
210+
'model.layers.11': 0,
211+
'model.layers.12': 0,
212+
'model.layers.13': 0,
213+
'model.layers.14': 0,
214+
'model.layers.15': 0,
215+
'model.layers.16': 0,
216+
'model.layers.17': 0,
217+
'model.layers.18': 0,
218+
'model.layers.19': 0,
219+
'model.layers.20': 0,
220+
'model.layers.21': 0,
221+
'model.layers.22': 0,
222+
'model.layers.23': 0,
223+
'model.layers.24': 0,
224+
'model.layers.25': 0,
225+
'model.layers.26': 0,
226+
'model.layers.27': 0,
227+
'model.layers.28': 0,
228+
'model.layers.29': 0,
229+
'model.layers.30': 0,
230+
'model.layers.31': 0,
231+
'model.layers.32': 0,
232+
'model.layers.33': 0,
233+
'model.layers.34': 0,
234+
'model.layers.35': 0,
235+
'model.layers.36': 0,
236+
'model.layers.37': 0,
237+
'model.layers.38': 1,
238+
'model.layers.39': 1,
239+
'model.layers.40': 1,
240+
'model.layers.41': 1,
241+
'model.layers.42': 1,
242+
'model.layers.43': 1,
243+
'model.layers.44': 1,
244+
'model.layers.45': 1,
245+
'model.layers.46': 1,
246+
'model.layers.47': 1,
247+
'model.layers.48': 1,
248+
'model.layers.49': 1,
249+
'model.layers.50': 1,
250+
'model.layers.51': 1,
251+
'model.layers.52': 1,
252+
'model.layers.53': 1,
253+
'model.layers.54': 1,
254+
'model.layers.55': 1,
255+
'model.layers.56': 1,
256+
'model.layers.57': 1,
257+
'model.layers.58': 1,
258+
'model.layers.59': 1,
259+
'model.layers.60': 1,
260+
'model.layers.61': 1,
261+
'model.layers.62': 1,
262+
'model.layers.63': 1,
263+
'model.layers.64': 1,
264+
'model.layers.65': 1,
265+
'model.layers.66': 1,
266+
'model.layers.67': 1,
267+
'model.layers.68': 1,
268+
'model.layers.69': 1,
269+
'model.layers.70': 1,
270+
'model.layers.71': 1,
271+
'model.layers.72': 1,
272+
'model.layers.73': 1,
273+
'model.layers.74': 1,
274+
'model.layers.75': 1,
275+
'model.layers.76': 1,
276+
'model.layers.77': 1,
277+
'model.layers.78': 1,
278+
'model.layers.79': 1,
279+
'model.norm': 1,
280+
'lm_head': 1}
281+
282+
</details>
283+
284+
## Testing the model

0 commit comments

Comments
 (0)