Dipika Sikka
|
890d8d960b
|
[Kernel] compressed-tensors marlin 24 support (#5435)
|
2024-06-17 12:32:48 -04:00 |
|
Charles Riggins
|
9e74d9d003
|
Correct alignment in the seq_len diagram. (#5592)
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>
|
2024-06-17 12:05:33 -04:00 |
|
Amit Garg
|
9333fb8eb9
|
[Model] Rename Phi3 rope scaling type (#5595)
|
2024-06-17 12:04:14 -04:00 |
|
Cody Yu
|
e2b85cf86a
|
Fix w8a8 benchmark and add Llama-3-8B (#5562)
|
2024-06-17 06:48:06 +00:00 |
|
youkaichao
|
845a3f26f9
|
[Doc] add debugging tips for crash and multi-node debugging (#5581)
|
2024-06-17 10:08:01 +08:00 |
|
youkaichao
|
f07d513320
|
[build][misc] limit numpy version (#5582)
|
2024-06-16 16:07:01 -07:00 |
|
Michael Goin
|
4a6769053a
|
[CI][BugFix] Flip is_quant_method_supported condition (#5577)
|
2024-06-16 14:07:34 +00:00 |
|
Antoni Baum
|
f31c1f90e3
|
Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518)
|
2024-06-16 07:48:02 +00:00 |
|
zifeitong
|
3ce2c050dd
|
[Fix] Correct OpenAI batch response format (#5554)
|
2024-06-15 16:57:54 -07:00 |
|
Nick Hill
|
1c0afa13c5
|
[BugFix] Don't start a Ray cluster when not using Ray (#5570)
|
2024-06-15 16:30:51 -07:00 |
|
Alexander Matveev
|
d919ecc771
|
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145)
|
2024-06-15 13:38:16 -04:00 |
|
SangBin Cho
|
e691918e3b
|
[misc] Do not allow to use lora with chunked prefill. (#5538)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-06-15 14:59:36 +00:00 |
|
Cyrus Leung
|
81fbb3655f
|
[CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568)
|
2024-06-15 07:29:42 -04:00 |
|
Cyrus Leung
|
0e9164b40a
|
[mypy] Enable type checking for test directory (#5017)
|
2024-06-15 04:45:31 +00:00 |
|
leiwen83
|
1b8a0d71cf
|
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-06-14 17:23:56 -07:00 |
|
Simon Mo
|
bd7efe95d0
|
Add ccache to amd (#5555)
|
2024-06-14 17:18:22 -07:00 |
|
youkaichao
|
f5bb85b435
|
[Core][Distributed] improve p2p cache generation (#5528)
|
2024-06-14 14:47:45 -07:00 |
|
Woosuk Kwon
|
28c145eb57
|
[Bugfix] Fix typo in Pallas backend (#5558)
|
2024-06-14 14:40:09 -07:00 |
|
Thomas Parnell
|
e2afb03c92
|
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-06-14 20:28:11 +00:00 |
|
Sanger Steel
|
6e2527a7cb
|
[Doc] Update documentation on Tensorizer (#5471)
|
2024-06-14 11:27:57 -07:00 |
|
Simon Mo
|
cdab68dcdb
|
[Docs] Add ZhenFund as a Sponsor (#5548)
|
2024-06-14 11:17:21 -07:00 |
|
youkaichao
|
d1c3d7d139
|
[misc][distributed] fix benign error in is_in_the_same_node (#5512)
|
2024-06-14 10:59:28 -07:00 |
|
Cyrus Leung
|
77490c6f2f
|
[Core] Remove duplicate processing in async engine (#5525)
|
2024-06-14 10:04:42 -07:00 |
|
youkaichao
|
48f589e18b
|
[mis] fix flaky test of test_cuda_device_count_stateless (#5546)
|
2024-06-14 10:02:23 -07:00 |
|
Tyler Michael Smith
|
348616ac4b
|
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401)
|
2024-06-14 10:02:00 -07:00 |
|
Robert Shaw
|
15985680e2
|
[ Misc ] Rs/compressed tensors cleanup (#5432)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
|
2024-06-14 10:01:46 -07:00 |
|
Allen.Dou
|
d74674bbd9
|
[Misc] Fix arg names (#5524)
|
2024-06-14 09:47:44 -07:00 |
|
Tyler Michael Smith
|
703475f6c2
|
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516)
|
2024-06-14 09:30:15 -07:00 |
|
Cyrus Leung
|
d47af2bc02
|
[CI/Build] Disable LLaVA-NeXT CPU test (#5529)
|
2024-06-14 09:27:30 -07:00 |
|
Kuntai Du
|
319ad7f1d3
|
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label (#5073)
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-06-13 22:36:20 -07:00 |
|
Simon Mo
|
0f0d8bc065
|
bump version to v0.5.0.post1 (#5522)
|
2024-06-13 19:42:06 -07:00 |
|
Allen.Dou
|
55d6361b13
|
[Misc] Fix arg names in quantizer script (#5507)
|
2024-06-13 19:02:53 -07:00 |
|
Jie Fu (傅杰)
|
cd9c0d65d9
|
[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452)
|
2024-06-13 17:22:24 -06:00 |
|
Antoni Baum
|
50eed24d25
|
Add cuda_device_count_stateless (#5473)
|
2024-06-13 16:06:49 -07:00 |
|
Tyler Michael Smith
|
e38042d4af
|
[Kernel] Disable CUTLASS kernels for fp8 (#5505)
|
2024-06-13 13:38:05 -07:00 |
|
Tyler Michael Smith
|
33e3b37242
|
[CI/Build] Disable test_fp8.py (#5508)
|
2024-06-13 13:37:48 -07:00 |
|
youkaichao
|
1696efe6c9
|
[misc] fix format.sh (#5511)
|
2024-06-13 12:09:16 -07:00 |
|
Antoni Baum
|
6b0511a57b
|
Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478)
|
2024-06-13 11:22:50 -07:00 |
|
Antoni Baum
|
a8fda4f661
|
Seperate dev requirements into lint and test (#5474)
|
2024-06-13 11:22:41 -07:00 |
|
Cody Yu
|
30299a41fa
|
[MISC] Remove FP8 warning (#5472)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
|
2024-06-13 11:22:30 -07:00 |
|
Tyler Michael Smith
|
85657b5607
|
[Kernel] Factor out epilogues from cutlass kernels (#5391)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-13 11:22:19 -07:00 |
|
Cyrus Leung
|
0ce7b952f8
|
[Doc] Update LLaVA docs (#5437)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-06-13 11:22:07 -07:00 |
|
Cyrus Leung
|
39873476f8
|
[CI/Build] Simplify OpenAI server setup in tests (#5100)
|
2024-06-13 11:21:53 -07:00 |
|
Cyrus Leung
|
03dccc886e
|
[Misc] Add vLLM version getter to utils (#5098)
|
2024-06-13 11:21:39 -07:00 |
|
Woosuk Kwon
|
a65634d3ae
|
[Docs] Add 4th meetup slides (#5509)
|
2024-06-13 10:18:26 -07:00 |
|
Li, Jiang
|
80aa7e91fc
|
[Hardware][Intel] Optimize CPU backend and add more performance tips (#4971)
Co-authored-by: Jianan Gu <jianan.gu@intel.com>
|
2024-06-13 09:33:14 -07:00 |
|
wenyujin333
|
bd43973522
|
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497)
Tune Qwen2-57B-A14B configs based on #4921
Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2
A100 GPU
benchmark no config w/ PR
tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s
tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
|
2024-06-13 09:01:10 -07:00 |
|
Michael Goin
|
23ec72fa03
|
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466)
|
2024-06-13 15:18:08 +00:00 |
|
Dipika Sikka
|
c2637a613b
|
[Kernel] w4a16 support for compressed-tensors (#5385)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-13 10:19:56 -04:00 |
|
Wang, Yi
|
88407532e7
|
[Bugfix]if the content is started with ":"(response of ping), client should i… (#5303)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-06-12 20:16:41 -07:00 |
|