Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Cyrus Leung
39873476f8
[CI/Build] Simplify OpenAI server setup in tests ( #5100 )
2024-06-13 11:21:53 -07:00
Michael Goin
23ec72fa03
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations ( #5466 )
2024-06-13 15:18:08 +00:00
Dipika Sikka
c2637a613b
[Kernel] w4a16 support for compressed-tensors ( #5385 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 10:19:56 -04:00
youkaichao
ea3890a5f0
[Core][Distributed] code deduplication in tp&pp with coordinator( #5293 )
...
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293 )
2024-06-12 17:27:08 -07:00
Travis Johnson
51602eefd3
[Frontend] [Core] Support for sharded tensorized models ( #4990 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-12 14:13:52 -07:00
Cody Yu
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
SangBin Cho
847cdcca1c
[CI] Upgrade codespell version. ( #5381 )
2024-06-12 10:06:14 -07:00
Simon Mo
e3c12bf6d2
Revert "[CI/Build] Add is_quant_method_supported to control quantization test configurations" ( #5463 )
2024-06-12 10:03:24 -07:00
Michael Goin
3dd6853bc8
[CI/Build] Add is_quant_method_supported to control quantization test configurations ( #5253 )
2024-06-12 09:58:02 -07:00
Nick Hill
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case ( #5230 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-11 11:10:41 -07:00
youkaichao
c4bd03c7c5
[Core][Distributed] add same-node detection ( #5369 )
2024-06-11 10:53:59 -07:00
sasha0552
dcbf4286af
[Frontend] Customizable RoPE theta ( #5197 )
2024-06-11 10:42:26 -07:00
Cyrus Leung
640052b069
[Bugfix][Frontend] Cleanup "fix chat logprobs" ( #5026 )
2024-06-10 22:36:46 -07:00
maor-ps
351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs ( #5312 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-11 10:30:31 +08:00
Itay Etelis
774d1035e4
[Feature][Frontend]: Continued stream_options implementation also in CompletionRequest ( #5319 )
2024-06-10 14:22:09 +00:00
Cyrus Leung
6b29d6fe70
[Model] Initial support for LLaVA-NeXT ( #4199 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-10 12:47:15 +00:00
Cyrus Leung
0bfa1c4f13
[Misc] Improve error message when LoRA parsing fails ( #5194 )
2024-06-10 19:38:49 +08:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-06-10 03:49:46 +00:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
youkaichao
5d7e3d0176
[mis][ci/test] fix flaky test in test_sharded_state_loader.py ( #5361 )
...
[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361 )
2024-06-09 03:50:14 +00:00
youkaichao
8ea5e44a43
[CI/Test] improve robustness of test (vllm_runner) ( #5357 )
...
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357 )
2024-06-08 08:59:20 +00:00
youkaichao
9fb900f90c
[CI/Test] improve robustness of test (hf_runner) ( #5347 )
...
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347 )
2024-06-07 22:31:32 -07:00
Roger Wang
7a9cb294ae
[Frontend] Add OpenAI Vision API Support ( #5237 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-07 11:23:32 -07:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
youkaichao
388596c914
[Misc][Utils] allow get_open_port to be called for multiple times ( #5333 )
2024-06-06 22:15:11 -07:00
Itay Etelis
baa15a9ec3
[Feature][Frontend]: Add support for stream_options in ChatCompletionRequest ( #5135 )
2024-06-07 03:29:24 +00:00
Antoni Baum
ccdc490dda
[Core] Change LoRA embedding sharding to support loading methods ( #5038 )
2024-06-06 19:07:57 -07:00
Matthew Goldey
828da0d44e
[Frontend] enable passing multiple LoRA adapters at once to generate() ( #5300 )
2024-06-06 15:48:13 -05:00
liuyhwangyh
4efff036f0
Bugfix: fix broken of download models from modelscope ( #5233 )
...
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
2024-06-06 09:28:10 -07:00
Cyrus Leung
89c920785f
[CI/Build] Update vision tests ( #5307 )
2024-06-06 05:17:18 -05:00
Breno Faria
7b0a0dfb22
[Frontend][Core] Update Outlines Integration from FSM to Guide ( #4109 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
2024-06-05 16:49:12 -07:00
Nick Hill
faf71bcd4b
[Speculative Decoding] Add ProposerWorkerBase abstract class ( #5252 )
2024-06-05 14:53:05 -07:00
Woosuk Kwon
41ca62cf03
[Misc] Add CustomOp interface for device portability ( #5255 )
2024-06-05 09:18:19 -07:00
zifeitong
974fc9b845
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True ( #5226 )
2024-06-04 19:37:28 -07:00
Cyrus Leung
9ba093b4f4
[CI/Build] Simplify model loading for HfRunner ( #5251 )
2024-06-04 10:09:19 -07:00
Cyrus Leung
ec784b2526
[CI/Build] Add inputs tests ( #5215 )
2024-06-03 21:01:46 -07:00
afeldman-nm
f42a006b15
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend ( #5210 )
2024-06-03 20:32:57 -07:00
Toshiki Kataoka
06b2550cbb
[Bugfix] Support prompt_logprobs==0 ( #5217 )
2024-06-03 17:59:30 -07:00
Breno Faria
f775a07e30
[FRONTEND] OpenAI tools support named functions ( #5032 )
2024-06-03 18:25:29 -05:00
Kaiyang Chen
10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 ( #3834 )
2024-06-03 13:37:11 -07:00
Yuan
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
Tyler Michael Smith
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
Cyrus Leung
7a64d24aad
[Core] Support image processor ( #4197 )
2024-06-02 22:56:41 -07:00
Cyrus Leung
dfbe60dc62
[Misc] Simplify code and fix type annotations in conftest.py ( #5118 )
2024-06-02 16:05:50 -07:00
Simon Mo
ed59a7ed23
Update test_ignore_eos ( #4898 )
2024-06-02 02:21:53 +00:00
chenqianfzh
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
Varun Sundar Rabindranath
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-01 08:46:07 +00:00
Tyler Michael Smith
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
SnowDist
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-30 19:24:41 -07:00