Wei-Sheng Chin
|
0c785d344d
|
Add more percentiles and latencies (#7759)
|
2024-08-29 16:48:11 -07:00 |
|
chenqianfzh
|
4664ceaad6
|
support bitsandbytes 8-bit and FP4 quantized models (#7445)
|
2024-08-29 19:09:08 -04:00 |
|
Harsha vardhan manoj Bikki
|
257afc37c5
|
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
|
2024-08-29 13:58:14 -07:00 |
|
Dipika Sikka
|
86a677de42
|
[misc] update tpu int8 to use new vLLM Parameters (#7973)
|
2024-08-29 16:46:55 -04:00 |
|
Isotr0py
|
d78789ac16
|
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954)
|
2024-08-29 15:54:49 -04:00 |
|
kushanam
|
c334b1898b
|
extend cuda graph size for H200 (#7894)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-08-29 12:15:04 -07:00 |
|
Pavani Majety
|
6b3421567d
|
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-29 14:53:11 -04:00 |
|
Alexander Matveev
|
3f60f2244e
|
[Core] Combine async postprocessor and multi-step (#7921)
|
2024-08-29 11:18:26 -07:00 |
|
Jonas M. Kübler
|
f205c09854
|
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899)
|
2024-08-28 22:18:13 -07:00 |
|
youkaichao
|
ef99a78760
|
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982)
|
2024-08-28 21:27:06 -07:00 |
|
Peter Salas
|
74d5543ec5
|
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974)
|
2024-08-29 03:24:31 +00:00 |
|
youkaichao
|
a7f65c2be9
|
[torch.compile] remove reset (#7975)
|
2024-08-28 17:32:26 -07:00 |
|
Nick Hill
|
4289cad37f
|
[Frontend] Minor optimizations to zmq decoupled front-end (#7957)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-08-28 17:22:43 -07:00 |
|
Michael Goin
|
af59df0a10
|
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961)
|
2024-08-28 19:19:17 -04:00 |
|
youkaichao
|
ce6bf3a2cf
|
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-08-28 16:10:12 -07:00 |
|
bnellnm
|
3cdfe1f38b
|
[Bugfix] Make torch registration of punica ops optional (#7970)
|
2024-08-28 16:11:49 -06:00 |
|
Mor Zusman
|
fdd9daafa3
|
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651)
|
2024-08-28 15:06:52 -07:00 |
|
Stas Bekman
|
8c56e57def
|
[Doc] fix 404 link (#7966)
|
2024-08-28 13:54:23 -07:00 |
|
Woosuk Kwon
|
eeffde1ac0
|
[TPU] Upgrade PyTorch XLA nightly (#7967)
|
2024-08-28 13:10:21 -07:00 |
|
rasmith
|
e5697d161c
|
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
|
2024-08-28 15:37:47 -04:00 |
|
Pavani Majety
|
b98cc28f91
|
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-28 10:01:22 -07:00 |
|
Cyrus Leung
|
ef9baee3c5
|
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948)
|
2024-08-28 08:11:18 -07:00 |
|
Stas Bekman
|
98c12cffe5
|
[Doc] fix the autoAWQ example (#7937)
|
2024-08-28 12:12:32 +00:00 |
|
youkaichao
|
f52a43a8b9
|
[ci][test] fix pp test failure (#7945)
|
2024-08-28 01:27:07 -07:00 |
|
Cody Yu
|
e3580537a4
|
[Performance] Enable chunked prefill and prefix caching together (#7753)
|
2024-08-28 00:36:31 -07:00 |
|
Alexander Matveev
|
f508e03e7f
|
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911)
|
2024-08-28 00:02:30 -07:00 |
|
Cyrus Leung
|
51f86bf487
|
[mypy][CI/Build] Fix mypy errors (#7929)
|
2024-08-27 23:47:44 -07:00 |
|
bnellnm
|
c166e7e43e
|
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886)
|
2024-08-27 23:13:45 -04:00 |
|
youkaichao
|
bc6e42a9b1
|
[hardware][rocm] allow rocm to override default env var (#7926)
|
2024-08-27 19:50:06 -07:00 |
|
Peter Salas
|
fab5f53e2d
|
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902)
|
2024-08-28 01:53:56 +00:00 |
|
Jonathan Berkhahn
|
9c71c97ae2
|
[mypy] Enable mypy type checking for vllm/core (#7229)
|
2024-08-28 07:11:14 +08:00 |
|
zifeitong
|
5340a2dccf
|
[Model] Add multi-image input support for LLaVA-Next offline inference (#7230)
|
2024-08-28 07:09:02 +08:00 |
|
Philipp Schmid
|
345be0e244
|
[benchmark] Update TGI version (#7917)
|
2024-08-27 15:07:53 -07:00 |
|
Dipika Sikka
|
fc911880cc
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-27 15:07:09 -07:00 |
|
youkaichao
|
ed6f002d33
|
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924)
|
2024-08-27 12:06:11 -07:00 |
|
Isotr0py
|
b09c755be8
|
[Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916)
|
2024-08-27 17:36:09 +00:00 |
|
alexeykondrat
|
42e932c7d4
|
[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237)
|
2024-08-27 10:09:13 -07:00 |
|
Kunshang Ji
|
076169f603
|
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810)
|
2024-08-27 10:07:02 -07:00 |
|
Isotr0py
|
9db642138b
|
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897)
|
2024-08-27 15:28:30 +00:00 |
|
Patrick von Platen
|
6fc4e6e07a
|
[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739)
|
2024-08-27 12:40:02 +00:00 |
|
Cody Yu
|
9606c7197d
|
Revert #7509 (#7887)
|
2024-08-27 00:16:31 -07:00 |
|
youkaichao
|
64cc644425
|
[core][torch.compile] discard the compile for profiling (#7796)
|
2024-08-26 21:33:58 -07:00 |
|
Nick Hill
|
39178c7fbc
|
[Tests] Disable retries and use context manager for openai client (#7565)
|
2024-08-26 21:33:17 -07:00 |
|
Megha Agarwal
|
2eedede875
|
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
|
2024-08-26 20:53:20 -07:00 |
|
Dipika Sikka
|
015e6cc252
|
[Misc] Update compressed tensors lifecycle to remove prefix from create_weights (#7825)
|
2024-08-26 18:09:34 -06:00 |
|
omrishiv
|
760e9f71a8
|
[Bugfix] neuron: enable tensor parallelism (#7562)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
|
2024-08-26 15:13:13 -07:00 |
|
youkaichao
|
05826c887b
|
[misc] fix custom allreduce p2p cache file generation (#7853)
|
2024-08-26 15:02:25 -07:00 |
|
Dipika Sikka
|
dd9857f5fa
|
[Misc] Update gptq_marlin_24 to use vLLMParameters (#7762)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-26 17:44:54 -04:00 |
|
Dipika Sikka
|
665304092d
|
[Misc] Update qqq to use vLLMParameters (#7805)
|
2024-08-26 13:16:15 -06:00 |
|
Cody Yu
|
2deb029d11
|
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822)
|
2024-08-26 11:24:53 -07:00 |
|