Daniele
|
51f8aa90ad
|
[Bugfix][Frontend] remove duplicate init logger (#6581)
|
2024-07-19 10:16:27 -07:00 |
|
Thomas Parnell
|
a5314e8698
|
[Model] RowParallelLinear: pass bias to quant_method.apply (#6327)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-19 07:15:22 -06:00 |
|
Woo-Yeon Lee
|
a921e86392
|
[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369)
|
2024-07-19 06:01:09 -07:00 |
|
Cyrus Leung
|
6366efc67b
|
[Bugfix][Frontend] Fix missing /metrics endpoint (#6463)
|
2024-07-19 03:55:13 +00:00 |
|
Robert Shaw
|
dbe5588554
|
[ Misc ] non-uniform quantization via compressed-tensors for Llama (#6515)
|
2024-07-18 22:39:18 -04:00 |
|
Thomas Parnell
|
d4201e06d5
|
[Bugfix] Make spec. decode respect per-request seed. (#6034)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-07-18 19:22:08 -07:00 |
|
Nick Hill
|
b5672a112c
|
[Core] Multiprocessing Pipeline Parallel support (#6130)
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-18 19:15:52 -07:00 |
|
Simon Mo
|
c5df56f88b
|
Add support for a rope extension method (#6553)
|
2024-07-19 01:53:03 +00:00 |
|
Tyler Michael Smith
|
1689219ebf
|
[CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517)
|
2024-07-18 17:29:25 -07:00 |
|
Tyler Michael Smith
|
4ffffccb7e
|
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552)
|
2024-07-18 23:52:22 +00:00 |
|
youkaichao
|
f53b8f0d05
|
[ci][test] add correctness test for cpu offloading (#6549)
|
2024-07-18 23:41:06 +00:00 |
|
Kevin H. Luu
|
2d4733ba2d
|
Fix PR comment bot (#6554)
Signed-off-by: kevin <kevin@anyscale.com>
|
2024-07-18 14:48:29 -07:00 |
|
Michael Goin
|
15c6a079b1
|
[Model] Support Mistral-Nemo (#6548)
|
2024-07-18 20:31:50 +00:00 |
|
Kevin H. Luu
|
ecdb462c24
|
[ci] Reword Github bot comment (#6534)
|
2024-07-18 08:01:45 -07:00 |
|
Robert Shaw
|
58ca663224
|
[ Misc ] Improve Min Capability Checking in compressed-tensors (#6522)
|
2024-07-18 14:39:12 +00:00 |
|
Woosuk Kwon
|
4634c8728b
|
[TPU] Refactor TPU worker & model runner (#6506)
|
2024-07-18 01:34:16 -07:00 |
|
Noam Gat
|
c8a7d51c49
|
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501)
|
2024-07-18 07:47:13 +00:00 |
|
Nick Hill
|
e2fbaee725
|
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-07-18 15:13:30 +08:00 |
|
Cody Yu
|
8a74c68bd1
|
[Misc] Minor patch for draft model runner (#6523)
|
2024-07-18 06:06:21 +00:00 |
|
Rui Qiao
|
61e592747c
|
[Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
|
2024-07-17 22:27:09 -07:00 |
|
Nick Hill
|
d25877dd9b
|
[BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530)
|
2024-07-17 22:24:43 -07:00 |
|
youkaichao
|
1c27d25fb5
|
[core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-17 20:54:35 -07:00 |
|
Robert Shaw
|
18fecc3559
|
[ Kernel ] Fp8 Channelwise Weight Support (#6487)
|
2024-07-18 03:18:13 +00:00 |
|
Cody Yu
|
b5af8c223c
|
[Model] Pipeline parallel support for Mixtral (#6516)
|
2024-07-17 19:26:04 -07:00 |
|
Varun Sundar Rabindranath
|
b5241e41d9
|
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-18 01:38:35 +00:00 |
|
Alexander Matveev
|
e76466dde2
|
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338)
|
2024-07-17 14:30:28 -07:00 |
|
Antoni Baum
|
5f0b9933e6
|
[Bugfix] Fix Ray Metrics API usage (#6354)
|
2024-07-17 19:40:10 +00:00 |
|
milo157
|
a38524f338
|
[DOC] - Add docker image to Cerebrium Integration (#6510)
|
2024-07-17 10:22:53 -07:00 |
|
Cody Yu
|
2fa4623d9e
|
[Core] Refactor _prepare_model_input_tensors - take 2 (#6164)
|
2024-07-17 09:37:16 -07:00 |
|
Woosuk Kwon
|
a9a2e74d21
|
[Misc] Use torch.Tensor for type annotation (#6505)
|
2024-07-17 13:01:10 +00:00 |
|
Woosuk Kwon
|
e09ce759aa
|
[TPU] Remove multi-modal args in TPU backend (#6504)
|
2024-07-17 04:02:53 -07:00 |
|
Murali Andoorveedu
|
5fa6e9876e
|
[Bugfix] Fix for multinode crash on 4 PP (#6495)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-17 08:25:10 +00:00 |
|
Cyrus Leung
|
5bf35a91e4
|
[Doc][CI/Build] Update docs and tests to use vllm serve (#6431)
|
2024-07-17 07:43:21 +00:00 |
|
shangmingc
|
a19e8d3726
|
[Misc][Speculative decoding] Typos and typing fixes (#6467)
Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com>
|
2024-07-17 07:17:07 +00:00 |
|
Hongxia Yang
|
10383887e0
|
[ROCm] Cleanup Dockerfile and remove outdated patch (#6482)
|
2024-07-16 22:47:02 -07:00 |
|
Wushi Dong
|
1d094fd7c0
|
[Distributed][PP] only create embedding & lm head when necessary (#6455)
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
|
2024-07-16 19:20:26 -07:00 |
|
youkaichao
|
ce37be7ba0
|
[misc][distributed] add seed to dummy weights (#6491)
|
2024-07-16 19:16:34 -07:00 |
|
youkaichao
|
7f62077af5
|
[misc][distributed] improve tests (#6488)
|
2024-07-16 17:35:52 -07:00 |
|
youkaichao
|
09c2eb85dd
|
[ci][distributed] add pipeline parallel correctness test (#6410)
|
2024-07-16 15:44:22 -07:00 |
|
Michael Goin
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
Cody Yu
|
160e1d8c99
|
[Misc] Log spec decode metrics (#6454)
|
2024-07-16 20:37:10 +00:00 |
|
Jiaxin Shan
|
94162beb9f
|
[Doc] Fix the lora adapter path in server startup script (#6230)
|
2024-07-16 10:11:04 -07:00 |
|
Woosuk Kwon
|
c467dff24f
|
[Hardware][TPU] Support MoE with Pallas GMM kernel (#6457)
|
2024-07-16 09:56:28 -07:00 |
|
youkaichao
|
9f4ccec761
|
[doc][misc] remind to cancel debugging environment variables (#6481)
[doc][misc] remind users to cancel debugging environment variables after debugging (#6481)
|
2024-07-16 09:45:30 -07:00 |
|
Cyrus Leung
|
38ef94888a
|
[CI/Build] Remove "boardwalk" image asset (#6460)
|
2024-07-16 08:59:36 -07:00 |
|
Peng Guanwen
|
2bb0489cb3
|
[Core] Use numpy to speed up padded token processing (#6442)
|
2024-07-16 08:13:25 -07:00 |
|
Thomas Parnell
|
7508a3dc34
|
[Misc] Fix typos in spec. decode metrics logging. (#6470)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-16 13:55:15 +00:00 |
|
sasha0552
|
7a3d2a5b95
|
[Frontend] Support for chat completions input in the tokenize endpoint (#5923)
|
2024-07-16 20:18:09 +08:00 |
|
Cyrus Leung
|
d97011512e
|
[CI/Build] vLLM cache directory for images (#6444)
|
2024-07-15 23:12:25 -07:00 |
|
Woosuk Kwon
|
37d776606f
|
[Docs] Announce 5th meetup (#6458)
|
2024-07-15 21:04:58 -07:00 |
|