From 4cefa9b49b6cb2be6d7eac88315df65e0f0d8c9a Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Sat, 2 Dec 2023 15:52:47 -0800 Subject: [PATCH] [Docs] Update the AWQ documentation to highlight performance issue (#1883) --- docs/source/quantization/auto_awq.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/quantization/auto_awq.rst b/docs/source/quantization/auto_awq.rst index 0a2b4423..bbbb9aee 100644 --- a/docs/source/quantization/auto_awq.rst +++ b/docs/source/quantization/auto_awq.rst @@ -3,6 +3,12 @@ AutoAWQ ================== +.. warning:: + + Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better + accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency + inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version. + To create a new 4-bit quantized model, you can leverage `AutoAWQ `_. Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.