Kuntai Du
|
114d7270ff
|
[CI] Avoid naming different metrics with the same name in performance benchmark (#5615)
|
2024-06-17 21:37:18 -07:00 |
|
Cyrus Leung
|
32c86e494a
|
[Misc] Fix typo (#5618)
|
2024-06-17 20:58:30 -07:00 |
|
youkaichao
|
8eadcf0b90
|
[misc][typo] fix typo (#5620)
|
2024-06-17 20:54:57 -07:00 |
|
Joe Runde
|
5002175e80
|
[Kernel] Add punica dimensions for Granite 13b (#5559)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-06-18 03:54:11 +00:00 |
|
Isotr0py
|
daef218b55
|
[Model] Initialize Phi-3-vision support (#4986)
|
2024-06-17 19:34:33 -07:00 |
|
sroy745
|
fa9e385229
|
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131)
|
2024-06-17 21:29:09 -05:00 |
|
zifeitong
|
26e1188e51
|
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606)
|
2024-06-17 23:16:10 +00:00 |
|
Bruce Fontaine
|
a3e8a05d4c
|
[Bugfix] Fix KV head calculation for MPT models when using GQA (#5142)
|
2024-06-17 15:26:41 -07:00 |
|
youkaichao
|
e441bad674
|
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584)
|
2024-06-17 22:08:05 +00:00 |
|
youkaichao
|
1b44aaf4e3
|
[bugfix][distributed] fix 16 gpus local rank arrangement (#5604)
|
2024-06-17 21:35:04 +00:00 |
|
Kuntai Du
|
9e4e6fe207
|
[CI] the readability of benchmarking and prepare for dashboard (#5571)
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571)
|
2024-06-17 11:41:08 -07:00 |
|
Jie Fu (傅杰)
|
ab66536dbf
|
[CI/BUILD] Support non-AVX512 vLLM building and testing (#5574)
|
2024-06-17 14:36:10 -04:00 |
|
Kunshang Ji
|
728c4c8a06
|
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
2024-06-17 11:01:25 -07:00 |
|
zhyncs
|
1f12122b17
|
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed (#5588)
|
2024-06-17 09:40:35 -07:00 |
|
Dipika Sikka
|
890d8d960b
|
[Kernel] compressed-tensors marlin 24 support (#5435)
|
2024-06-17 12:32:48 -04:00 |
|
Charles Riggins
|
9e74d9d003
|
Correct alignment in the seq_len diagram. (#5592)
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>
|
2024-06-17 12:05:33 -04:00 |
|
Amit Garg
|
9333fb8eb9
|
[Model] Rename Phi3 rope scaling type (#5595)
|
2024-06-17 12:04:14 -04:00 |
|
Cody Yu
|
e2b85cf86a
|
Fix w8a8 benchmark and add Llama-3-8B (#5562)
|
2024-06-17 06:48:06 +00:00 |
|
youkaichao
|
845a3f26f9
|
[Doc] add debugging tips for crash and multi-node debugging (#5581)
|
2024-06-17 10:08:01 +08:00 |
|
youkaichao
|
f07d513320
|
[build][misc] limit numpy version (#5582)
|
2024-06-16 16:07:01 -07:00 |
|
Michael Goin
|
4a6769053a
|
[CI][BugFix] Flip is_quant_method_supported condition (#5577)
|
2024-06-16 14:07:34 +00:00 |
|
Antoni Baum
|
f31c1f90e3
|
Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518)
|
2024-06-16 07:48:02 +00:00 |
|
zifeitong
|
3ce2c050dd
|
[Fix] Correct OpenAI batch response format (#5554)
|
2024-06-15 16:57:54 -07:00 |
|
Nick Hill
|
1c0afa13c5
|
[BugFix] Don't start a Ray cluster when not using Ray (#5570)
|
2024-06-15 16:30:51 -07:00 |
|
Alexander Matveev
|
d919ecc771
|
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145)
|
2024-06-15 13:38:16 -04:00 |
|
SangBin Cho
|
e691918e3b
|
[misc] Do not allow to use lora with chunked prefill. (#5538)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-06-15 14:59:36 +00:00 |
|
Cyrus Leung
|
81fbb3655f
|
[CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568)
|
2024-06-15 07:29:42 -04:00 |
|
Cyrus Leung
|
0e9164b40a
|
[mypy] Enable type checking for test directory (#5017)
|
2024-06-15 04:45:31 +00:00 |
|
leiwen83
|
1b8a0d71cf
|
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-06-14 17:23:56 -07:00 |
|
Simon Mo
|
bd7efe95d0
|
Add ccache to amd (#5555)
|
2024-06-14 17:18:22 -07:00 |
|
youkaichao
|
f5bb85b435
|
[Core][Distributed] improve p2p cache generation (#5528)
|
2024-06-14 14:47:45 -07:00 |
|
Woosuk Kwon
|
28c145eb57
|
[Bugfix] Fix typo in Pallas backend (#5558)
|
2024-06-14 14:40:09 -07:00 |
|
Thomas Parnell
|
e2afb03c92
|
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-06-14 20:28:11 +00:00 |
|
Sanger Steel
|
6e2527a7cb
|
[Doc] Update documentation on Tensorizer (#5471)
|
2024-06-14 11:27:57 -07:00 |
|
Simon Mo
|
cdab68dcdb
|
[Docs] Add ZhenFund as a Sponsor (#5548)
|
2024-06-14 11:17:21 -07:00 |
|
youkaichao
|
d1c3d7d139
|
[misc][distributed] fix benign error in is_in_the_same_node (#5512)
|
2024-06-14 10:59:28 -07:00 |
|
Cyrus Leung
|
77490c6f2f
|
[Core] Remove duplicate processing in async engine (#5525)
|
2024-06-14 10:04:42 -07:00 |
|
youkaichao
|
48f589e18b
|
[mis] fix flaky test of test_cuda_device_count_stateless (#5546)
|
2024-06-14 10:02:23 -07:00 |
|
Tyler Michael Smith
|
348616ac4b
|
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401)
|
2024-06-14 10:02:00 -07:00 |
|
Robert Shaw
|
15985680e2
|
[ Misc ] Rs/compressed tensors cleanup (#5432)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
|
2024-06-14 10:01:46 -07:00 |
|
Allen.Dou
|
d74674bbd9
|
[Misc] Fix arg names (#5524)
|
2024-06-14 09:47:44 -07:00 |
|
Tyler Michael Smith
|
703475f6c2
|
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516)
|
2024-06-14 09:30:15 -07:00 |
|
Cyrus Leung
|
d47af2bc02
|
[CI/Build] Disable LLaVA-NeXT CPU test (#5529)
|
2024-06-14 09:27:30 -07:00 |
|
Kuntai Du
|
319ad7f1d3
|
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label (#5073)
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-06-13 22:36:20 -07:00 |
|
Simon Mo
|
0f0d8bc065
|
bump version to v0.5.0.post1 (#5522)
|
2024-06-13 19:42:06 -07:00 |
|
Allen.Dou
|
55d6361b13
|
[Misc] Fix arg names in quantizer script (#5507)
|
2024-06-13 19:02:53 -07:00 |
|
Jie Fu (傅杰)
|
cd9c0d65d9
|
[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452)
|
2024-06-13 17:22:24 -06:00 |
|
Antoni Baum
|
50eed24d25
|
Add cuda_device_count_stateless (#5473)
|
2024-06-13 16:06:49 -07:00 |
|
Tyler Michael Smith
|
e38042d4af
|
[Kernel] Disable CUTLASS kernels for fp8 (#5505)
|
2024-06-13 13:38:05 -07:00 |
|
Tyler Michael Smith
|
33e3b37242
|
[CI/Build] Disable test_fp8.py (#5508)
|
2024-06-13 13:37:48 -07:00 |
|