Cody Yu
|
f7dac83d95
|
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939)
|
2024-06-29 21:04:20 +08:00 |
|
Antoni Baum
|
7c01f70641
|
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum (#5974)
|
2024-06-29 12:47:53 +00:00 |
|
Cyrus Leung
|
51e971d39e
|
[Bugfix] Support eos_token_id from config.json (#5954)
|
2024-06-29 11:19:02 +00:00 |
|
Roger Wang
|
329df38f1a
|
[Misc] Update Phi-3-Vision Example (#5981)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-06-29 14:34:29 +08:00 |
|
Woosuk Kwon
|
580353da93
|
[Bugfix] Fix precisions in Gemma 1 (#5913)
|
2024-06-29 03:10:21 +00:00 |
|
Joe Runde
|
ba4994443a
|
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
Signed-off-by: Joe Runde <joe@joerun.de>
|
2024-06-29 10:48:25 +08:00 |
|
William Lin
|
906a19cdb0
|
[Misc] Extend vLLM Metrics logging API (#5925)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-06-29 10:36:06 +08:00 |
|
mcalman
|
c4bca740e8
|
[Bugfix] fix missing last itl in openai completions benchmark (#5926)
|
2024-06-29 10:34:42 +08:00 |
|
Woosuk Kwon
|
7f83f40dee
|
[Bugfix][TPU] Fix pad slot id (#5977)
|
2024-06-28 18:55:17 -07:00 |
|
Woosuk Kwon
|
54814fd85b
|
[Bugfix][TPU] Fix TPU sampler output (#5978)
|
2024-06-28 18:14:16 -07:00 |
|
Lily Liu
|
7041de4384
|
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>
|
2024-06-28 15:28:49 -07:00 |
|
Robert Shaw
|
6a62cb82cc
|
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (#5963)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-28 17:46:30 -04:00 |
|
Tyler Michael Smith
|
5d2a1a9cf0
|
Unmark more files as executable (#5962)
|
2024-06-28 17:34:56 -04:00 |
|
Michael Goin
|
4bf35ed9ae
|
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled (#5936)
|
2024-06-28 21:12:40 +00:00 |
|
wangding zeng
|
be0b3af9e0
|
Support Deepseek-V2 (#4650)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
|
2024-06-28 13:24:57 -07:00 |
|
Robert Shaw
|
2cd402e169
|
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-28 18:43:49 +00:00 |
|
Robert Shaw
|
b185230744
|
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) (#5928)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-28 13:49:57 -04:00 |
|
Tyler Michael Smith
|
6a2d659d28
|
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
|
2024-06-28 17:10:34 +00:00 |
|
Cody Yu
|
b2c620230a
|
[Spec Decode] Introduce DraftModelRunner (#5799)
|
2024-06-28 09:17:51 -07:00 |
|
xwjiang2010
|
b90d8cd832
|
[Distributed] Make it clear that % should not be in tensor dict keys. (#5927)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
|
2024-06-28 15:20:22 +00:00 |
|
Cyrus Leung
|
3b752a6555
|
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
|
2024-06-28 07:59:18 -07:00 |
|
Thomas Parnell
|
ec1ad0046c
|
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (#5894)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-06-28 07:42:17 -07:00 |
|
Ilya Lavrenov
|
57f09a419c
|
[Hardware][Intel] OpenVINO vLLM backend (#5379)
|
2024-06-28 13:50:16 +00:00 |
|
Tyler Michael Smith
|
5932634409
|
Unmark fused_moe config json file as executable (#5960)
|
2024-06-28 06:36:12 -07:00 |
|
Cyrus Leung
|
5cbe8d155c
|
[Core] Registry for processing model inputs (#5214)
Co-authored-by: ywang96 <ywang@roblox.com>
|
2024-06-28 12:09:56 +00:00 |
|
Isotr0py
|
0d0e3a42ac
|
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (#5956)
|
2024-06-28 12:03:41 +00:00 |
|
xwjiang2010
|
74d55c065b
|
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. (#5905)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-06-28 07:29:13 +00:00 |
|
Woosuk Kwon
|
f136da15e1
|
[Hardware][TPU] Optimize KV cache swapping (#5878)
|
2024-06-27 21:12:13 -07:00 |
|
Divakar Verma
|
c3dde367f1
|
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
|
2024-06-27 13:41:08 -07:00 |
|
youkaichao
|
64e8d2a783
|
[core][misc] remove logical block (#5882)
|
2024-06-27 13:34:55 -07:00 |
|
Woosuk Kwon
|
79c92c7c8a
|
[Model] Add Gemma 2 (#5908)
|
2024-06-27 13:33:56 -07:00 |
|
Roger Wang
|
736ed38849
|
[CI/Build] Fix Args for _get_logits_warper in Sampler Test (#5922)
|
2024-06-27 11:43:04 -07:00 |
|
Nick Hill
|
365791ff81
|
[BugFix] Fix min_tokens behaviour for multiple eos tokens (#5849)
|
2024-06-27 11:31:11 -07:00 |
|
Nick Hill
|
691e29ecf3
|
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens (#5876)
|
2024-06-27 10:59:33 -07:00 |
|
youkaichao
|
3fd02bda51
|
[doc][misc] add note for Kubernetes users (#5916)
|
2024-06-27 10:07:07 -07:00 |
|
Cyrus Leung
|
98cf2ed678
|
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896)
|
2024-06-27 09:08:10 -07:00 |
|
Cyrus Leung
|
e9d32d077d
|
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
|
2024-06-27 12:43:17 +00:00 |
|
Roger Wang
|
2061f0b8a7
|
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888)
|
2024-06-27 08:29:24 +00:00 |
|
Cyrus Leung
|
96354d6a29
|
[Model] Add base class for LoRA-supported models (#5018)
|
2024-06-27 16:03:04 +08:00 |
|
xwjiang2010
|
d12af207d2
|
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly (#5880)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
|
2024-06-27 15:15:24 +08:00 |
|
Cyrus Leung
|
6eabc6cb0e
|
[Doc] Add note about context length in Phi-3-Vision example (#5887)
|
2024-06-26 23:20:01 -07:00 |
|
Nick Hill
|
2110557dab
|
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com>
|
2024-06-27 04:12:10 +00:00 |
|
Roger Wang
|
b9e84259e9
|
[Misc] Add example for LLaVA-NeXT (#5879)
|
2024-06-26 17:57:16 -07:00 |
|
youkaichao
|
294104c3f9
|
[doc] update usage of env var to avoid conflict (#5873)
|
2024-06-26 17:57:12 -04:00 |
|
Chip Kerchner
|
38a1674abb
|
Support CPU inference with VSX PowerPC ISA (#5652)
|
2024-06-26 21:53:04 +00:00 |
|
Woosuk Kwon
|
f5c8628fdc
|
[Bugfix][TPU] Fix CPU cache allocation (#5869)
|
2024-06-26 13:42:40 -07:00 |
|
Woosuk Kwon
|
cbc53b6b8d
|
[Hardware][TPU] Support parallel sampling & Swapping (#5855)
|
2024-06-26 11:07:49 -07:00 |
|
sasha0552
|
c54269d967
|
[Frontend] Add tokenize/detokenize endpoints (#5054)
|
2024-06-26 16:54:22 +00:00 |
|
Luka Govedič
|
5bfd1bbc98
|
[Kernel] Adding bias epilogue support for cutlass_scaled_mm (#5560)
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2024-06-26 15:16:00 +00:00 |
|
Cyrus Leung
|
6984c02a27
|
[CI/Build] Refactor image test assets (#5821)
|
2024-06-26 01:02:34 -07:00 |
|