Install AWQ

Setup environment (please refer to this link for more details):

conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip    # enable PEP 660 support
pip install -e .             # install fastchat

git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e .             # install awq package

cd awq/kernels				
python setup.py install	     # install awq CUDA kernels

Chat with the CLI

# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq

# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7b-v1.3-4bit-g128-awq \
    --awq-wbits 4 \
    --awq-groupsize 128 

Benchmark

  • Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.

  • Benchmark on NVIDIA RTX A6000:

    ModelBitsMax Memory (MiB)Speed (ms/token)AWQ Speedup
    vicuna-7b161354326.06/
    vicuna-7b4554712.432.1x
    llama2-7b-chat161354327.14/
    llama2-7b-chat4554712.442.2x
    vicuna-13b162564744.91/
    vicuna-13b4935517.302.6x
    llama2-13b-chat162564747.28/
    llama2-13b-chat4935520.282.3x
  • NVIDIA RTX 4090:

    ModelAWQ 4bit Speed (ms/token)FP16 Speed (ms/token)AWQ Speedup
    vicuna-7b8.6119.092.2x
    llama2-7b-chat8.6619.972.3x
    vicuna-13b12.17OOM/
    llama2-13b-chat13.54OOM/
  • NVIDIA Jetson Orin:

    ModelAWQ 4bit Speed (ms/token)FP16 Speed (ms/token)AWQ Speedup
    vicuna-7b65.3493.121.4x
    llama2-7b-chat75.11104.711.4x
    vicuna-13b115.40OOM/
    llama2-13b-chat136.81OOM/