We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.
Model | Bits | Max Memory (MiB) | Speed (ms/token) | AWQ Speedup |
---|---|---|---|---|
vicuna-7b | 16 | 13543 | 26.06 | / |
vicuna-7b | 4 | 5547 | 12.43 | 2.1x |
llama2-7b-chat | 16 | 13543 | 27.14 | / |
llama2-7b-chat | 4 | 5547 | 12.44 | 2.2x |
vicuna-13b | 16 | 25647 | 44.91 | / |
vicuna-13b | 4 | 9355 | 17.30 | 2.6x |
llama2-13b-chat | 16 | 25647 | 47.28 | / |
llama2-13b-chat | 4 | 9355 | 20.28 | 2.3x |
Model | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
---|---|---|---|
vicuna-7b | 8.61 | 19.09 | 2.2x |
llama2-7b-chat | 8.66 | 19.97 | 2.3x |
vicuna-13b | 12.17 | OOM | / |
llama2-13b-chat | 13.54 | OOM | / |
Model | AWQ 4bit Speed (ms/token) | FP16 Speed (ms/token) | AWQ Speedup |
---|---|---|---|
vicuna-7b | 65.34 | 93.12 | 1.4x |
llama2-7b-chat | 75.11 | 104.71 | 1.4x |
vicuna-13b | 115.40 | OOM | / |
llama2-13b-chat | 136.81 | OOM | / |