vllm/vllm/model_executor/parallel_utils
Zhuohan Li 7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)
Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
..
__init__.py TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
communication_op.py Implement prompt logprobs & Batched topk for computing logprobs (#1328) 2023-10-16 10:56:50 -07:00
parallel_state.py TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
README.md Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
utils.py TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622) 2023-11-15 22:50:41 -08:00

The files in this folder are ported from Megatron-LM. We only keep the codes that are used in inference.