Antoni Baum
|
ccdc490dda
|
[Core] Change LoRA embedding sharding to support loading methods (#5038)
|
2024-06-06 19:07:57 -07:00 |
|
Matthew Goldey
|
828da0d44e
|
[Frontend] enable passing multiple LoRA adapters at once to generate() (#5300)
|
2024-06-06 15:48:13 -05:00 |
|
liuyhwangyh
|
4efff036f0
|
Bugfix: fix broken of download models from modelscope (#5233)
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
|
2024-06-06 09:28:10 -07:00 |
|
Cyrus Leung
|
89c920785f
|
[CI/Build] Update vision tests (#5307)
|
2024-06-06 05:17:18 -05:00 |
|
Breno Faria
|
7b0a0dfb22
|
[Frontend][Core] Update Outlines Integration from FSM to Guide (#4109)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
|
2024-06-05 16:49:12 -07:00 |
|
Nick Hill
|
faf71bcd4b
|
[Speculative Decoding] Add ProposerWorkerBase abstract class (#5252)
|
2024-06-05 14:53:05 -07:00 |
|
Woosuk Kwon
|
41ca62cf03
|
[Misc] Add CustomOp interface for device portability (#5255)
|
2024-06-05 09:18:19 -07:00 |
|
zifeitong
|
974fc9b845
|
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226)
|
2024-06-04 19:37:28 -07:00 |
|
Cyrus Leung
|
9ba093b4f4
|
[CI/Build] Simplify model loading for HfRunner (#5251)
|
2024-06-04 10:09:19 -07:00 |
|
Cyrus Leung
|
ec784b2526
|
[CI/Build] Add inputs tests (#5215)
|
2024-06-03 21:01:46 -07:00 |
|
afeldman-nm
|
f42a006b15
|
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210)
|
2024-06-03 20:32:57 -07:00 |
|
Toshiki Kataoka
|
06b2550cbb
|
[Bugfix] Support prompt_logprobs==0 (#5217)
|
2024-06-03 17:59:30 -07:00 |
|
Breno Faria
|
f775a07e30
|
[FRONTEND] OpenAI tools support named functions (#5032)
|
2024-06-03 18:25:29 -05:00 |
|
Kaiyang Chen
|
10c38e3e46
|
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)
|
2024-06-03 13:37:11 -07:00 |
|
Yuan
|
cafb8e06c5
|
[CI/BUILD] enable intel queue for longer CPU tests (#4113)
|
2024-06-03 10:39:50 -07:00 |
|
Tyler Michael Smith
|
cbb2f59cc8
|
[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159)
|
2024-06-03 09:52:30 -07:00 |
|
Cyrus Leung
|
7a64d24aad
|
[Core] Support image processor (#4197)
|
2024-06-02 22:56:41 -07:00 |
|
Cyrus Leung
|
dfbe60dc62
|
[Misc] Simplify code and fix type annotations in conftest.py (#5118)
|
2024-06-02 16:05:50 -07:00 |
|
Simon Mo
|
ed59a7ed23
|
Update test_ignore_eos (#4898)
|
2024-06-02 02:21:53 +00:00 |
|
chenqianfzh
|
b9c0605a8e
|
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
|
2024-06-01 14:51:10 -06:00 |
|
Varun Sundar Rabindranath
|
f081c3ce4b
|
[Kernel] Update Cutlass fp8 configs (#5144)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-01 08:46:07 +00:00 |
|
Tyler Michael Smith
|
260d119e86
|
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137)
|
2024-06-01 06:45:32 +00:00 |
|
SnowDist
|
a22dea54d3
|
[Model] Support MAP-NEO model (#5081)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-05-30 19:24:41 -07:00 |
|
Breno Faria
|
87d41c849d
|
[BUGFIX] [FRONTEND] Correct chat logprobs (#5029)
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
|
2024-05-30 02:52:14 -07:00 |
|
Cyrus Leung
|
b1c255630d
|
[Core] Avoid the need to pass None values to Sequence.inputs (#5099)
|
2024-05-29 16:05:01 -07:00 |
|
Cyrus Leung
|
eecd864388
|
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators (#5096)
|
2024-05-29 16:02:25 -07:00 |
|
afeldman-nm
|
4238bc82f2
|
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837)
|
2024-05-29 16:09:13 +00:00 |
|
Cyrus Leung
|
18c1f16d86
|
[Bugfix] Fix arguments passed to Sequence in stop checker test (#5092)
|
2024-05-29 07:16:41 +00:00 |
|
youkaichao
|
5bd3c65072
|
[Core][Optimization] remove vllm-nccl (#5091)
|
2024-05-29 05:13:52 +00:00 |
|
Junichi Sato
|
dfba529b40
|
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
|
2024-05-28 17:15:35 -07:00 |
|
Cyrus Leung
|
5ae5ed1e60
|
[Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-05-28 13:29:31 -07:00 |
|
Michał Moskal
|
d4f3985907
|
[Core] Sliding window for block manager v2 (#4545)
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
|
2024-05-28 11:07:07 +09:00 |
|
Zhuohan Li
|
1102bef219
|
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-05-27 15:18:17 -07:00 |
|
Lily Liu
|
d5a1697772
|
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000)
|
2024-05-25 10:00:14 -07:00 |
|
Eric Xihui Lin
|
8e192ff967
|
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799)
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-05-24 22:00:52 -07:00 |
|
leiwen83
|
e64fde4b01
|
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-05-24 10:07:09 -07:00 |
|
Robert Shaw
|
919770957f
|
[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-05-24 12:28:27 +00:00 |
|
Dipika Sikka
|
a1242324c9
|
[Kernel] Initial Activation Quantization Support (#4525)
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-05-23 21:29:18 +00:00 |
|
Murali Andoorveedu
|
5eda2ea02a
|
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-05-23 09:54:48 -07:00 |
|
Alexander Matveev
|
6066253296
|
Marlin 24 prefill performance improvement (about 25% better on average) (#4983)
|
2024-05-23 02:39:27 -04:00 |
|
Cody Yu
|
ee3eea0a1b
|
[Misc] Take user preference in attention selector (#4960)
|
2024-05-23 07:55:56 +09:00 |
|
raywanb
|
97b030005c
|
[Model] LoRA gptbigcode implementation (#3949)
|
2024-05-22 13:58:59 -07:00 |
|
Cody Yu
|
a3a73ab069
|
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
The 2nd PR for #4532.
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
|
2024-05-22 13:28:20 -07:00 |
|
Tyler Michael Smith
|
8674f9880e
|
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
|
2024-05-22 14:10:43 +00:00 |
|
SangBin Cho
|
c74c913bfb
|
[misc] remove comments that were supposed to be removed (#4977)
|
2024-05-22 09:02:58 -04:00 |
|
sasha0552
|
9b9a10d6cb
|
[Frontend] Dynamic RoPE scaling (#4638)
|
2024-05-22 01:32:35 -04:00 |
|
Isotr0py
|
f12c3b5b3d
|
[Model] Add Phi-2 LoRA support (#4886)
|
2024-05-21 14:24:17 +09:00 |
|
Alexei-V-Ivanov-AMD
|
943e72ca56
|
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
|
2024-05-20 11:29:28 -07:00 |
|
Woosuk Kwon
|
b57e6c5949
|
[Kernel] Add flash-attn back (#4907)
|
2024-05-19 18:11:30 -07:00 |
|
Alexander Matveev
|
27ce85476e
|
[Kernel] Add marlin_24 unit tests (#4901)
|
2024-05-19 11:37:34 -04:00 |
|