Using FastChat
AWQ 4bit Inference
We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.
Install AWQ
Setup environment (please refer to this link for more details):
Chat with the CLI
Benchmark
-
Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.
-
Benchmark on NVIDIA RTX A6000:
Model Bits Max Memory (MiB) Speed (ms/token) AWQ Speedup vicuna-7b 16 13543 26.06 / vicuna-7b 4 5547 12.43 2.1x llama2-7b-chat 16 13543 27.14 / llama2-7b-chat 4 5547 12.44 2.2x vicuna-13b 16 25647 44.91 / vicuna-13b 4 9355 17.30 2.6x llama2-13b-chat 16 25647 47.28 / llama2-13b-chat 4 9355 20.28 2.3x -
NVIDIA RTX 4090:
Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup vicuna-7b 8.61 19.09 2.2x llama2-7b-chat 8.66 19.97 2.3x vicuna-13b 12.17 OOM / llama2-13b-chat 13.54 OOM / -
NVIDIA Jetson Orin:
Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup vicuna-7b 65.34 93.12 1.4x llama2-7b-chat 75.11 104.71 1.4x vicuna-13b 115.40 OOM / llama2-13b-chat 136.81 OOM /