-
Notifications
You must be signed in to change notification settings - Fork 0
feat: llm compressor AI-23582 #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
EdisonSu768
wants to merge
4
commits into
master
Choose a base branch
from
feat/llm-compressor
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,3 +4,4 @@ knative | |
| kserve | ||
| xinference | ||
| servicemeshv1 | ||
| ipynb | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| --- | ||
| weight: 30 | ||
| --- | ||
|
|
||
| # LLM Compressor with Alauda AI | ||
|
|
||
| This document describes how to use the LLM Compressor integration with the Alauda AI platform to perform model compression workflows. The Alauda AI integration of LLM Compressor provides two example workflows: | ||
|
|
||
| - A workbench image and the <a href="/data-free-compressor.ipynb" target="_blank">data-free compressor notebook</a> that demonstrate how to compress a model. | ||
| - A workbench image and the <a href="/calibration-compressor.ipynb" target="_blank">calibration compressor notebook</a> that demonstrate how to compress a model using a calibration dataset. | ||
|
|
||
| ## Supported Model Compression Workflows | ||
|
|
||
| On the Alauda AI platform, you can use the Workbench feature to run LLM Compressor on models stored in your model repository. The following workflow outlines the typical steps for compressing a model. | ||
|
|
||
| ### Create a Workbench | ||
|
|
||
| Follow the instructions in [Create Workbench](../../workbench/how_to/create_workbench.mdx) to create a new Workbench instance. Note that model compression is currently supported only within **JupyterLab**. | ||
|
|
||
| ### Create a Model Repository and Upload Models | ||
|
|
||
| Refer to [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx) for detailed steps on creating a model repository and uploading your model files. The example notebooks in this guide use the [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) model. | ||
|
|
||
| ### (Optional) Prepare and Upload a Dataset | ||
|
|
||
| :::note | ||
| If you plan to use the **data-free compressor notebook**, you can skip this step. | ||
| ::: | ||
|
|
||
| To use the **calibration compressor notebook**, you must prepare and upload a calibration dataset. Prepare your dataset using the same process described in *Upload Models Using Notebook*. The example calibration notebook uses the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. | ||
|
|
||
| ### Clone Models and Datasets in JupyterLab | ||
|
|
||
| In the JupyterLab terminal, use `git clone` to download the model repository (and dataset, if applicable) to your workspace. The data-free compressor notebook does not require a dataset. | ||
|
|
||
| ### Create and Run Compression Notebooks | ||
|
|
||
| Download the appropriate example notebook for your use case: the **calibration compressor notebook** if you are using a dataset, or the **data-free compressor notebook** otherwise. Create a new notebook (for example, `compressor.ipynb`) in JupyterLab and paste the contents of the example notebook into it. Run the cells to perform model compression. | ||
|
|
||
| ### Upload the Compressed Model to the Repository | ||
|
|
||
| Once compression is complete, upload the compressed model back to the model repository using the steps outlined in *Upload Models Using Notebook*. | ||
|
|
||
| ### Deploy and Use the Compressed Model for Inference | ||
|
|
||
| Quantized and sparse models that you create with LLM Compressor are saved using the `compressed-tensors` library (an extension of [Safetensors](https://huggingface.co/docs/safetensors/en/index)). | ||
| The compression format matches the model's quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Alauda AI Inference Server. | ||
| Follow the instructions in [create inference service](../../model_inference/inference_service/functions/inference_service.mdx#create-inference-service) to complete this step. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| --- | ||
| weight: 60 | ||
| --- | ||
|
|
||
| # How To | ||
|
|
||
| <Overview /> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| --- | ||
| weight: 82 | ||
| --- | ||
|
|
||
| # LLM Compressor | ||
|
|
||
| <Overview /> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| --- | ||
| weight: 10 | ||
| --- | ||
|
|
||
| # Introduction | ||
|
|
||
| ## Preface | ||
|
|
||
| [LLM Compressor](https://github.com/vllm-project/llm-compressor), part of [the vLLM project](https://docs.vllm.ai/en/latest/) for efficient serving of LLMs, integrates the latest model compression research into a single open-source library enabling the generation of efficient, compressed models with minimal effort. | ||
|
|
||
| The framework allows users to apply some of the most recent research on model compression techniques to improve generative AI (gen AI) models' efficiency, scalability and performance while maintaining accuracy. With native support for Hugging Face and vLLM, the compressed models can be integrated into deployment pipelines, delivering faster and more cost-effective inference at scale. | ||
|
|
||
| LLM Compressor allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by LLM Compressor: | ||
|
|
||
| - **Quantization**: Converts model weights and activations to lower-bit formats such as int8, reducing memory usage. | ||
| - **Sparsity**: Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation. | ||
| - **Compression**: Shrinks the saved model file size, ideally with minimal impact on performance. | ||
|
|
||
| Use these methods together to deploy models more efficiently on resource-limited hardware. | ||
|
|
||
| ## LLM Compressor supports a wide variety of compression techniques: | ||
|
|
||
| - Weight-only quantization (W4A16) compresses model weights to 4-bit precision, valuable for AI applications with limited hardware resources or high sensitivity to latency. | ||
| - Weight and activation quantization (W8A8) compresses both weights and activations to 8-bit precision, targeting general server scenarios for integer and floating-point formats. | ||
| - Weight pruning, also known as sparsification, removes certain weights from the model entirely. While this requires fine-tuning, it can be used in conjunction with quantization for further inference acceleration. | ||
|
|
||
| ## LLM Compressor supports several compression algorithms: | ||
|
|
||
| - AWQ: Weight-only `INT4` quantization | ||
| - GPTQ: Weight-only `INT4` quantization | ||
| - FP8: Dynamic per-token quantization | ||
| - SparseGPT: Post-training sparsity | ||
| - SmoothQuant: Activation quantization | ||
|
|
||
| Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,164 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## LLM Compressor Workbench -- Getting Started\n", | ||
| "\n", | ||
| "This notebook will demonstrate how common [LLM Compressor](https://github.com/vllm-project/llm-compressor) flows can be run on the Alauda AI.\n", | ||
| "\n", | ||
| "We will show how a user can compress a Large Language Model, with a calibration dataset.\n", | ||
| "\n", | ||
| "The notebook will detect if a GPU is available. If one is not available, it will demonstrate an abbreviated run, so users without GPU access can still get a feel for `llm-compressor`." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Calibrated Compression with a Dataset\n", | ||
| "\n", | ||
| "Some more advanced compression algorithms require a small dataset of calibration samples that are meant to be a representative random subset of the language the model will see at inference.\n", | ||
| "\n", | ||
| "We will show how the previous section can be augmented with a calibration dataset and GPTQ, one of the first published LLM compression algorithms.\n", | ||
| "\n", | ||
| "<div class=\"alert alert-block alert-info\">\n", | ||
| "<b>Note:</b> This will take several minutes if no GPU is available\n", | ||
| "</div>" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import torch\n", | ||
| "\n", | ||
| "use_gpu = torch.cuda.is_available()" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# We will use a new recipe running GPTQ (https://arxiv.org/abs/2210.17323)\n", | ||
| "# to reduce error caused by quantization. GPTQ requires a calibration dataset.\n", | ||
| "from llmcompressor.modifiers.quantization import GPTQModifier\n", | ||
| "\n", | ||
| "# model to compress\n", | ||
| "model_id = \"./TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\n", | ||
| "recipe = GPTQModifier(targets=\"Linear\", scheme=\"W4A16\", ignore=[\"lm_head\"])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Load up model using huggingface API\n", | ||
| "from transformers import AutoModelForCausalLM, AutoTokenizer\n", | ||
| "\n", | ||
| "model = AutoModelForCausalLM.from_pretrained(\n", | ||
| " model_id, device_map=\"auto\", torch_dtype=\"auto\"\n", | ||
| ")\n", | ||
| "tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from datasets import load_dataset\n", | ||
| "\n", | ||
| "# Create the calibration dataset, using Huggingface datasets API\n", | ||
| "dataset_id = \"./ultrachat_200k\"\n", | ||
| "\n", | ||
| "# Select number of samples. 512 samples is a good place to start.\n", | ||
| "# Increasing the number of samples can improve accuracy.\n", | ||
| "num_calibration_samples = 512 if use_gpu else 4\n", | ||
| "max_sequence_length = 2048 if use_gpu else 16\n", | ||
| "\n", | ||
| "# Load dataset\n", | ||
| "ds = load_dataset(dataset_id, split=\"train_sft\")\n", | ||
| "# Shuffle and grab only the number of samples we need\n", | ||
| "ds = ds.shuffle(seed=42).select(range(num_calibration_samples))\n", | ||
| "\n", | ||
| "\n", | ||
| "# Preprocess and tokenize into format the model uses\n", | ||
| "def preprocess(example):\n", | ||
| " text = tokenizer.apply_chat_template(\n", | ||
| " example[\"messages\"],\n", | ||
| " tokenize=False,\n", | ||
| " )\n", | ||
| " return tokenizer(\n", | ||
| " text,\n", | ||
| " padding=False,\n", | ||
| " max_length=max_sequence_length,\n", | ||
| " truncation=True,\n", | ||
| " add_special_tokens=False,\n", | ||
| " )\n", | ||
| "\n", | ||
| "\n", | ||
| "ds = ds.map(preprocess, remove_columns=ds.column_names)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# run oneshot, with dataset\n", | ||
| "from llmcompressor import oneshot\n", | ||
| "\n", | ||
| "model = oneshot(\n", | ||
| " model=model,\n", | ||
| " dataset=ds,\n", | ||
| " recipe=recipe,\n", | ||
| " max_seq_length=max_sequence_length,\n", | ||
| " num_calibration_samples=num_calibration_samples,\n", | ||
| ")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Save model and tokenizer\n", | ||
| "model_dir = \"./\" + model_id.split(\"/\")[-1] + \"-GPTQ-W4A16\"\n", | ||
| "model.save_pretrained(model_dir)\n", | ||
| "tokenizer.save_pretrained(model_dir);" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": ".venv", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.10.12" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 2 | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## LLM Compressor Workbench -- Getting Started\n", | ||
| "\n", | ||
| "This notebook will demonstrate how common [LLM Compressor](https://github.com/vllm-project/llm-compressor) flows can be run on the Alauda AI.\n", | ||
| "\n", | ||
| "We will show how a user can compress a Large Language Model, without data." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Data-Free Model Compression" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from llmcompressor.modifiers.quantization import QuantizationModifier\n", | ||
| "\n", | ||
| "# model to compress\n", | ||
| "model_id = \"./TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\n", | ||
| "\n", | ||
| "# This recipe will quantize all Linear layers except those in the `lm_head`,\n", | ||
| "# which is often sensitive to quantization. The W4A16 scheme compresses\n", | ||
| "# weights to 4-bit integers while retaining 16-bit activations.\n", | ||
| "recipe = QuantizationModifier(targets=\"Linear\", scheme=\"W4A16\", ignore=[\"lm_head\"])" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Load up model using huggingface API\n", | ||
| "from transformers import AutoModelForCausalLM, AutoTokenizer\n", | ||
| "\n", | ||
| "model = AutoModelForCausalLM.from_pretrained(\n", | ||
| " model_id, device_map=\"auto\", torch_dtype=\"auto\"\n", | ||
| ")\n", | ||
| "tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Run compression using `oneshot`\n", | ||
| "from llmcompressor import oneshot\n", | ||
| "\n", | ||
| "model = oneshot(model=model, recipe=recipe, tokenizer=tokenizer)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Save model and tokenizer\n", | ||
| "model_dir = \"./\" + model_id.split(\"/\")[-1] + \"-W4A16\"\n", | ||
| "model.save_pretrained(model_dir)\n", | ||
| "tokenizer.save_pretrained(model_dir);" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": ".venv", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.10.12" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 2 | ||
| } |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.