Commit Graph

2504 Commits

Author SHA1 Message Date
Harsha vardhan manoj Bikki
257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
2024-08-29 13:58:14 -07:00
Dipika Sikka
86a677de42
[misc] update tpu int8 to use new vLLM Parameters (#7973) 2024-08-29 16:46:55 -04:00
Isotr0py
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954) 2024-08-29 15:54:49 -04:00
kushanam
c334b1898b
extend cuda graph size for H200 (#7894)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-29 12:15:04 -07:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step (#7921) 2024-08-29 11:18:26 -07:00
Jonas M. Kübler
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899) 2024-08-28 22:18:13 -07:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) 2024-08-28 21:27:06 -07:00
Peter Salas
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974) 2024-08-29 03:24:31 +00:00
youkaichao
a7f65c2be9
[torch.compile] remove reset (#7975) 2024-08-28 17:32:26 -07:00
Nick Hill
4289cad37f
[Frontend] Minor optimizations to zmq decoupled front-end (#7957)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-08-28 17:22:43 -07:00
Michael Goin
af59df0a10
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961) 2024-08-28 19:19:17 -04:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-08-28 16:10:12 -07:00
bnellnm
3cdfe1f38b
[Bugfix] Make torch registration of punica ops optional (#7970) 2024-08-28 16:11:49 -06:00
Mor Zusman
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) 2024-08-28 15:06:52 -07:00
Stas Bekman
8c56e57def
[Doc] fix 404 link (#7966) 2024-08-28 13:54:23 -07:00
Woosuk Kwon
eeffde1ac0
[TPU] Upgrade PyTorch XLA nightly (#7967) 2024-08-28 13:10:21 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386) 2024-08-28 15:37:47 -04:00
Pavani Majety
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-28 10:01:22 -07:00
Cyrus Leung
ef9baee3c5
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948) 2024-08-28 08:11:18 -07:00
Stas Bekman
98c12cffe5
[Doc] fix the autoAWQ example (#7937) 2024-08-28 12:12:32 +00:00
youkaichao
f52a43a8b9
[ci][test] fix pp test failure (#7945) 2024-08-28 01:27:07 -07:00
Cody Yu
e3580537a4
[Performance] Enable chunked prefill and prefix caching together (#7753) 2024-08-28 00:36:31 -07:00
Alexander Matveev
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911) 2024-08-28 00:02:30 -07:00
Cyrus Leung
51f86bf487
[mypy][CI/Build] Fix mypy errors (#7929) 2024-08-27 23:47:44 -07:00
bnellnm
c166e7e43e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886) 2024-08-27 23:13:45 -04:00
youkaichao
bc6e42a9b1
[hardware][rocm] allow rocm to override default env var (#7926) 2024-08-27 19:50:06 -07:00
Peter Salas
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902) 2024-08-28 01:53:56 +00:00
Jonathan Berkhahn
9c71c97ae2
[mypy] Enable mypy type checking for vllm/core (#7229) 2024-08-28 07:11:14 +08:00
zifeitong
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference (#7230) 2024-08-28 07:09:02 +08:00
Philipp Schmid
345be0e244
[benchmark] Update TGI version (#7917) 2024-08-27 15:07:53 -07:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
youkaichao
ed6f002d33
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924) 2024-08-27 12:06:11 -07:00
Isotr0py
b09c755be8
[Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916) 2024-08-27 17:36:09 +00:00
alexeykondrat
42e932c7d4
[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237) 2024-08-27 10:09:13 -07:00
Kunshang Ji
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810) 2024-08-27 10:07:02 -07:00
Isotr0py
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897) 2024-08-27 15:28:30 +00:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739) 2024-08-27 12:40:02 +00:00
Cody Yu
9606c7197d
Revert #7509 (#7887) 2024-08-27 00:16:31 -07:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling (#7796) 2024-08-26 21:33:58 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client (#7565) 2024-08-26 21:33:17 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Dipika Sikka
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix from create_weights (#7825) 2024-08-26 18:09:34 -06:00
omrishiv
760e9f71a8
[Bugfix] neuron: enable tensor parallelism (#7562)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
2024-08-26 15:13:13 -07:00
youkaichao
05826c887b
[misc] fix custom allreduce p2p cache file generation (#7853) 2024-08-26 15:02:25 -07:00
Dipika Sikka
dd9857f5fa
[Misc] Update gptq_marlin_24 to use vLLMParameters (#7762)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-26 17:44:54 -04:00
Dipika Sikka
665304092d
[Misc] Update qqq to use vLLMParameters (#7805) 2024-08-26 13:16:15 -06:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) 2024-08-26 11:24:53 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836) 2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00