vllm/parallel_utils at bb00f66e19acdf6cb614683ab74f777ed3932eee - vllm

History

Zhuohan Li 7076fa1c9f TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622 ) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - All models are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.		2023-11-15 22:50:41 -08:00
..
__init__.py	TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181 )	2023-10-02 15:36:09 -07:00
communication_op.py	Implement prompt logprobs & Batched topk for computing logprobs (#1328 )	2023-10-16 10:56:50 -07:00
parallel_state.py	TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181 )	2023-10-02 15:36:09 -07:00
README.md	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
utils.py	TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622 )	2023-11-15 22:50:41 -08:00

The files in this folder are ported from Megatron-LM. We only keep the codes that are used in inference.