Commit Graph

3544 Commits

Author SHA1 Message Date
Zhonghua Deng
d345f409b7
[V1] EngineCore supports profiling (#10564)
Signed-off-by: Abatom <abzhonghua@gmail.com>
2024-11-22 17:16:15 -08:00
Russell Bryant
28598f3939
[Core] remove temporary local variables in LLMEngine.__init__ (#10577)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-22 16:22:53 -08:00
zixuanzhang226
948c859571
support bitsandbytes quantization with qwen model (#10549)
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com>
2024-11-22 16:16:14 -08:00
Ricky Xu
97814fbf0f
[v1] Refactor KVCacheManager for more hash input than token ids (#10507)
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-11-22 23:27:25 +00:00
youkaichao
eebad39f26
[torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 14:04:42 -08:00
youkaichao
db100c5cde
[bugfix] fix full graph tests (#10581)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 10:02:14 -08:00
Noam Gat
11fcf0e066
Remove token-adding chat embedding params (#10551)
Signed-off-by: Noam Gat <noamgat@gmail.com>
2024-11-21 23:59:47 -08:00
Isotr0py
b6374e09b0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-22 15:01:56 +08:00
youkaichao
a111d0151f
[platforms] absorb worker cls difference into platforms folder (#10555)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-21 21:00:32 -08:00
Woosuk Kwon
446c7806b2
[Minor] Fix line-too-long (#10563)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 19:40:40 -08:00
youkaichao
33e0a2540a
[9/N] torch.compile LLM usage (#10552)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 19:13:31 -08:00
Simon Mo
aed074860a
[Benchmark] Add new H100 machine (#10547) 2024-11-21 18:27:20 -08:00
Michael Goin
9afa014552
Add small example to metrics.rst (#10550) 2024-11-21 23:43:43 +00:00
Woosuk Kwon
46fe9b46d8
[Minor] Revert change in offline inference example (#10545)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 21:28:16 +00:00
youkaichao
cf656f5a02
[misc] improve error message (#10553)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 13:13:17 -08:00
Yunmeng
edec3385b6
[CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535)
Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-11-21 13:03:58 -08:00
Woosuk Kwon
f9310cbd0c
[V1] Fix Compilation config & Enable CUDA graph by default (#10528)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 12:53:39 -08:00
youkaichao
7560ae5caf
[8/N] enable cli flag without a space (#10529)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 12:30:42 -08:00
Cyrus Leung
e7a8341c7c
[Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-21 18:09:43 +00:00
Roger Wang
c51e397fe8
[Misc] Suppress duplicated logging regarding multimodal input pipeline (#10530)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-21 09:21:31 -08:00
Jee Jee Li
2385b60d83
[Kernel] Register punica ops directly (#10522)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-21 09:18:11 -08:00
Chauncey
da7e702c6f
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-21 16:24:32 +00:00
Xiaoyu Zhang
4d676f0852
[Bugfix] Embedding model pooling_type equals ALL and multi input's bug (#10494) 2024-11-21 14:40:02 +00:00
Isotr0py
d5ec121f95
[Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models (#10518)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-21 14:20:08 +00:00
Wang, Yi
8a93a598d9
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (#10524)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-11-21 11:15:36 +00:00
Alex Brooks
1cfde82ffd
[Model] Add Support for Multimodal Granite Models (#10291)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-21 10:46:20 +00:00
Zhong Qishuai
f0e0238016
[Doc] fix a small typo in docstring of llama_tool_parser (#10513) 2024-11-21 09:05:23 +00:00
youkaichao
aaddce5d26
[platforms] improve error message for unspecified platforms (#10520)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 23:07:56 -08:00
Cyrus Leung
3430857b64
[Misc] Increase default video fetch timeout (#10495)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 23:06:42 -08:00
Luka Govedič
8b0fe06c89
[torch.compile] Inductor code caching fix (#10273)
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
2024-11-20 21:44:57 -08:00
Mengqing Cao
9d827170a3
[Platforms] Add device_type in Platform (#10508)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-21 04:44:20 +00:00
Pavani Majety
6c1208d083
[Core] Add Sliding Window Support with Flashinfer (#10462)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2024-11-20 19:56:47 -08:00
youkaichao
388ee3de66
[torch.compile] limit inductor threads and lazy import quant (#10482)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 18:36:33 -08:00
Woosuk Kwon
2f77b6cfec
[TPU] Implement prefix caching for TPUs (#10307)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-20 13:54:15 -08:00
Guillaume Calmettes
c68f7ede6a
[Bugfix]: allow extra fields in requests to openai compatible server (#10463)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-20 16:42:21 -05:00
youkaichao
0cd3d9717e
[7/N] torch.compile, reduce compilation time (#10460)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 11:20:38 -08:00
Simon Mo
5f1d6af2b6
[perf bench] H200 development (#9768)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-20 11:06:56 -08:00
youkaichao
772a66732d
[platforms] restore xpu check for parallel config (#10479)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 17:13:28 +00:00
Li, Jiang
63f1fde277
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-20 10:57:39 +00:00
Mengqing Cao
d5b28447e0
[Platforms] Refactor xpu code (#10468)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-19 22:52:13 -08:00
Cyrus Leung
09dbf9ff16
[Bugfix] Handle conflicts between modern and legacy fields (#10471)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 14:45:08 +08:00
Sky Lee
343041c4c4
[model] Reduce medusa weight (#10454)
Signed-off-by: skylee-01 <497627264@qq.com>
2024-11-20 06:05:55 +00:00
Kevin H. Luu
ed701ca963
[ci/build] Combine nightly and optional (#10465) 2024-11-19 21:36:03 -08:00
wchen61
7629a9c6e5
[CI/Build] Support compilation with local cutlass path (#10423) (#10424) 2024-11-19 21:35:50 -08:00
Rafael Vasquez
709c9f1f25
[CI/Build] Add sphinx/rst linter for docs (#10366) 2024-11-19 21:35:31 -08:00
Cyrus Leung
b4be5a8adb
[Bugfix] Enforce no chunked prefill for embedding models (#10470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 05:12:51 +00:00
Isotr0py
ad44437ba3
[Bugfix] Fix Mamba model initialization and MLP Speculator weights loading (#10456)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-20 05:04:05 +00:00
Yanyi Liu
9e05252b46
[Misc] Add __setitem__ for LazyDict (#10469)
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
2024-11-20 04:44:57 +00:00
Lucas Wilkinson
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-19 19:40:33 -08:00
Alexei-V-Ivanov-AMD
d5b68aba2f
[CI/Build] Update Dockerfile.rocm (#10434)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2024-11-19 17:19:59 -08:00