SangBin Cho
0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption ( #4451 )
2024-05-01 19:24:13 -07:00
Danny Guinther
b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided ( #4273 )
2024-05-01 17:34:40 -07:00
Woosuk Kwon
826b82a260
[Misc] Fix expert_ids shape in MoE ( #4517 )
2024-05-01 23:47:59 +00:00
Philipp Moritz
c9d852d601
[Misc] Remove Mixtral device="cuda" declarations ( #4543 )
...
Remove the device="cuda" declarations in mixtral as promised in #4343
2024-05-01 16:30:52 -07:00
youkaichao
6ef09b08f8
[Core][Distributed] fix pynccl del error ( #4508 )
2024-05-01 15:23:06 -07:00
Roy
3a922c1e7e
[Bugfix][Core] Fix and refactor logging stats ( #4336 )
2024-05-01 20:08:14 +00:00
sasha0552
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
Philipp Moritz
24bb4fe432
[Kernel] Update fused_moe tuning script for FP8 ( #4457 )
...
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.
All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.
Before this PR (with static activation scaling):
qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency
After this PR (with static activation scaling):
qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency
2024-05-01 11:47:38 -07:00
Nick Hill
a657bfc48a
[Core] Add multiproc_worker_utils for multiprocessing-based workers ( #4357 )
2024-05-01 18:41:59 +00:00
leiwen83
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
leiwen83
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding ( #4237 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
Travis Johnson
8b798eec75
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation ( #4534 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-05-01 18:01:50 +00:00
sasha0552
69909126a7
[Bugfix] Use random seed if seed is -1 ( #4531 )
2024-05-01 10:41:17 -07:00
Frαnçois
e491c7e053
[Doc] update(example model): for OpenAI compatible serving ( #4503 )
2024-05-01 10:14:16 -07:00
Robert Shaw
4dc8026d86
[Bugfix] Fix 307 Redirect for /metrics ( #4523 )
2024-05-01 09:14:13 -07:00
AnyISalIn
a88bb9b032
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. ( #4173 )
...
Signed-off-by: AnyISalIn <anyisalin@gmail.com>
2024-05-01 09:11:03 -07:00
SangBin Cho
6f1df80436
[Test] Add ignore_eos test ( #4519 )
2024-05-01 08:45:42 -04:00
Jee Li
d6f4bd7cdd
[Misc]Add customized information for models ( #4132 )
2024-04-30 21:18:14 -07:00
Robert Caulk
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
Pastel!
a822eb3413
[Misc] fix typo in block manager ( #4453 )
2024-04-30 20:41:32 -07:00
harrywu
f458112e8a
[Misc][Typo] type annotation fix ( #4495 )
2024-04-30 20:21:39 -07:00
Nick Hill
2e240c69a9
[Core] Centralize GPU Worker construction ( #4419 )
2024-05-01 01:06:34 +00:00
fuchen.ljl
ee37328da0
Unable to find Punica extension issue during source code installation ( #4494 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-05-01 00:42:09 +00:00
fuchen.ljl
6ad58f42c5
fix_tokenizer_snapshot_download_bug ( #4493 )
2024-04-30 16:38:50 -07:00
Li, Jiang
dd1a50a8bc
[Bugfix][Minor] Make ignore_eos effective ( #4468 )
2024-04-30 16:33:33 -07:00
Alpay Ariyak
715c2d854d
[Frontend] [Core] Tensorizer: support dynamic num_readers, update version ( #4467 )
2024-04-30 16:32:13 -07:00
Florian Greinacher
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-04-30 16:28:46 -07:00
Robert Shaw
111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) ( #4332 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-04-30 21:46:12 +00:00
Prashant Gupta
b31a1fb63c
[Doc] add visualization for multi-stage dockerfile ( #4456 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-30 17:41:59 +00:00
leiwen83
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor ( #4165 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
Kunshang Ji
26f2fb5113
[Core]Refactor gptq_marlin ops ( #4466 )
2024-04-30 08:14:47 -04:00
Woosuk Kwon
fa32207842
[Bugfix][Kernel] Fix compute_type for MoE kernel ( #4463 )
2024-04-29 22:05:40 -07:00
Michael Goin
d627a3d837
[Misc] Upgrade to torch==2.3.0 ( #4454 )
2024-04-29 20:05:47 -04:00
youkaichao
f4f921b7f1
[Core][Distributed] use cpu group to broadcast metadata in cpu ( #4444 )
2024-04-29 13:52:22 -07:00
Simon Mo
ac5ccf0156
[CI] hotfix: soft fail neuron test ( #4458 )
2024-04-29 19:50:01 +00:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
SangBin Cho
df29793dc7
[mypy][5/N] Support all typing on model executor ( #4427 )
2024-04-28 19:01:26 -07:00
Simon Mo
03dd7d52bf
[CI] clean docker cache for neuron ( #4441 )
2024-04-28 23:32:07 +00:00
Ronen Schaffer
bf480c5302
Add more Prometheus metrics ( #2764 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-28 15:59:33 -07:00
DefTruth
9c7306ac11
[Misc] fix typo in llm_engine init logging ( #4428 )
2024-04-28 18:58:30 +08:00
Robert Shaw
4ea1f9678d
[BugFix] Resolved Issues For LinearMethod --> QuantConfig ( #4418 )
2024-04-27 18:35:33 +00:00
Nick Hill
ba4be44c32
[BugFix] Fix return type of executor execute_model methods ( #4402 )
2024-04-27 11:17:45 -07:00
Prashant Gupta
d6e520e170
[Core] Support offline use of local cache for models ( #4374 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
2024-04-27 09:59:55 -07:00
Nick Hill
81661da7b2
[BugFix] Fix min_tokens when eos_token_id is None ( #4389 )
...
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
2024-04-27 09:52:46 -07:00
Ruoyu Qin
dfea173148
[Bugfix] Abort requests when the connection to /v1/completions is interrupted ( #4363 )
2024-04-27 09:48:37 -07:00
Roy
7134303cbb
[Bugfix][Core] Fix get decoding config from ray ( #4335 )
2024-04-27 11:30:08 +00:00
Caio Mendes
3da24c2df7
[Model] Phi-3 4k sliding window temp. fix ( #4380 )
2024-04-27 18:08:15 +08:00
Austin Veselka
eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers ( #3524 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Hongxia Yang
18d23f642a
[ROCm][Hardware][AMD] Enable group query attention for triton FA ( #4406 )
2024-04-26 23:37:40 -07:00
Roy
87f545ba6f
[Misc] Fix logger format typo ( #4396 )
2024-04-27 13:45:02 +08:00