Lily Liu
d6ab528997
[Misc] Remove flashinfer warning, add flashinfer tests to CI ( #6351 )
2024-07-12 01:32:06 +00:00
Robert Shaw
7ed6a4f0e1
[ BugFix ] Prompt Logprobs Detokenization ( #6223 )
...
Co-authored-by: Zifei Tong <zifeitong@gmail.com>
2024-07-11 22:02:29 +00:00
Kuntai Du
a4feba929b
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy ( #5362 )
2024-07-11 13:28:38 -07:00
youkaichao
2d23b42d92
[doc] update pipeline parallel in readme ( #6347 )
2024-07-11 11:38:40 -07:00
xwjiang2010
1df43de9bb
[bug fix] Fix llava next feature size calculation. ( #6339 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
2024-07-11 17:21:10 +00:00
Simon Mo
52b7fcb35a
Benchmark: add H100 suite ( #6047 )
2024-07-11 09:17:07 -07:00
Robert Shaw
b675069d74
[ Misc ] Refactor Marlin Python Utilities ( #6082 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-07-11 15:40:11 +00:00
Mor Zusman
55f692b46e
[BugFix] get_and_reset only when scheduler outputs are not empty ( #6266 )
2024-07-11 07:40:20 -07:00
Thomas Parnell
8a1415cf77
[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. ( #6326 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-11 07:05:59 -07:00
pushan
546b101fa0
[BugFix]: fix engine timeout due to request abort ( #6255 )
...
Signed-off-by: yatta zhang <ytzhang01@foxmail.com>
Signed-off-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com>
Co-authored-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com>
2024-07-11 06:46:31 -07:00
aniaan
3963a5335b
[Misc] refactor(config): clean up unused code ( #6320 )
2024-07-11 09:39:07 +00:00
Roger Wang
c4774eb841
[Bugfix] Fix snapshot download in serving benchmark ( #6318 )
2024-07-11 07:04:05 +00:00
Lim Xiang Yang
fc17110bbe
[BugFix]: set outlines pkg version ( #6262 )
2024-07-11 04:37:11 +00:00
Jie Fu (傅杰)
439c84581a
[Doc] Update description of vLLM support for CPUs ( #6003 )
2024-07-10 21:15:29 -07:00
daquexian
99ded1e1c4
[Doc] Remove comments incorrectly copied from another project ( #6286 )
2024-07-10 17:05:26 -07:00
Woosuk Kwon
997df46a32
[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor ( #6313 )
2024-07-10 16:39:02 -07:00
sroy745
ae151d73be
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models ( #5765 )
2024-07-10 16:02:47 -07:00
sangjune.park
44cc76610d
[Bugfix] Fix OpenVINOExecutor abstractmethod error ( #6296 )
...
Signed-off-by: sangjune.park <sangjune.park@navercorp.com>
2024-07-10 10:03:32 -07:00
Benjamin Muskalla
b422d4961a
[CI/Build] Enable mypy typing for remaining folders ( #6268 )
2024-07-10 22:15:55 +08:00
Thomas Parnell
c38eba3046
[Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. ( #6303 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-10 09:04:07 -04:00
Woosuk Kwon
e72ae80b06
[Bugfix] Support 2D input shape in MoE layer ( #6287 )
2024-07-10 09:03:16 -04:00
Cyrus Leung
8a924d2248
[Doc] Guide for adding multi-modal plugins ( #6205 )
2024-07-10 14:55:34 +08:00
Woosuk Kwon
5ed3505d82
[Bugfix][TPU] Add prompt adapter methods to TPUExecutor ( #6279 )
2024-07-09 19:30:56 -07:00
youkaichao
da78caecfa
[core][distributed] zmq fallback for broadcasting large objects ( #6183 )
...
[core][distributed] add zmq fallback for broadcasting large objects (#6183 )
2024-07-09 18:49:11 -07:00
Abhinav Goyal
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
Baoyuan Qi
d3a245138a
[Bugfix]fix and needs_scalar_to_array logic check ( #6238 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-07-09 23:43:24 +00:00
Murali Andoorveedu
673dd4cae9
[Docs] Docs update for Pipeline Parallel ( #6222 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-09 16:24:58 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
Kevin H. Luu
a0550cbc80
Add support for multi-node on CI ( #5955 )
...
Signed-off-by: kevin <kevin@anyscale.com>
2024-07-09 12:56:56 -07:00
Woosuk Kwon
08c5bdecae
[Bugfix][TPU] Fix outlines installation in TPU Dockerfile ( #6256 )
2024-07-09 02:56:06 -07:00
Woosuk Kwon
5d5b4c5fe5
[Bugfix][TPU] Add missing None to model input ( #6245 )
2024-07-09 00:21:37 -07:00
youkaichao
70c232f85a
[core][distributed] fix ray worker rank assignment ( #6235 )
2024-07-08 21:31:44 -07:00
youkaichao
a3c9435d93
[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability ( #6216 )
2024-07-08 20:02:15 -07:00
Simon Mo
4f0e0ea131
Add FlashInfer to default Dockerfile ( #6172 )
2024-07-08 13:38:03 -07:00
tomeras91
ddc369fba1
[Bugfix] Mamba cache Cuda Graph padding ( #6214 )
2024-07-08 11:25:51 -07:00
Eric
185ad31f37
[Bugfix] use diskcache in outlines _get_guide #5436 ( #6203 )
2024-07-08 11:23:24 -07:00
afeldman-nm
543aa48573
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) ( #4888 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-08 17:12:15 +00:00
Avshalom Manevich
f7a8fa39d8
[Kernel] reloading fused_moe config on the last chunk ( #6210 )
2024-07-08 08:00:38 -07:00
Haichuan
717f4bcea0
Feature/add benchmark testing ( #5947 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-08 07:52:06 +00:00
kczimm
16620f439d
do not exclude object field in CompletionStreamResponse ( #6196 )
2024-07-08 10:32:57 +08:00
youkaichao
3b08fe2b13
[misc][frontend] log all available endpoints ( #6195 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-07-07 15:11:12 -07:00
Robert Shaw
abfe705a02
[ Misc ] Support Fp8 via llm-compressor ( #6110 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
Haichuan
333306a252
add benchmark for fix length input and output ( #5857 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-07 07:42:13 +00:00
Roger Wang
6206dcb29e
[Model] Add PaliGemma ( #5189 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-07 09:25:50 +08:00
Cyrus Leung
9389380015
[Doc] Move guide for multimodal model and other improvements ( #6168 )
2024-07-06 17:18:59 +08:00
Roger Wang
175c43eca4
[Doc] Reorganize Supported Models by Type ( #6167 )
2024-07-06 05:59:36 +00:00
Simon Mo
bc96d5c330
Move release wheel env var to Dockerfile instead ( #6163 )
2024-07-05 17:19:53 -07:00
Simon Mo
f0250620dd
Fix release wheel build env var ( #6162 )
2024-07-05 16:24:31 -07:00
Simon Mo
2de490d60f
Update wheel builds to strip debug ( #6161 )
2024-07-05 14:51:25 -07:00
Simon Mo
79d406e918
[Docs] Fix readthedocs for tag build ( #6158 )
2024-07-05 12:44:40 -07:00