Commit Graph

2172 Commits

Author SHA1 Message Date
Robert Shaw
58ca663224
[ Misc ] Improve Min Capability Checking in compressed-tensors (#6522) 2024-07-18 14:39:12 +00:00
Woosuk Kwon
4634c8728b
[TPU] Refactor TPU worker & model runner (#6506) 2024-07-18 01:34:16 -07:00
Noam Gat
c8a7d51c49
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501) 2024-07-18 07:47:13 +00:00
Nick Hill
e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-07-18 15:13:30 +08:00
Cody Yu
8a74c68bd1
[Misc] Minor patch for draft model runner (#6523) 2024-07-18 06:06:21 +00:00
Rui Qiao
61e592747c
[Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2024-07-17 22:27:09 -07:00
Nick Hill
d25877dd9b
[BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530) 2024-07-17 22:24:43 -07:00
youkaichao
1c27d25fb5
[core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-17 20:54:35 -07:00
Robert Shaw
18fecc3559
[ Kernel ] Fp8 Channelwise Weight Support (#6487) 2024-07-18 03:18:13 +00:00
Cody Yu
b5af8c223c
[Model] Pipeline parallel support for Mixtral (#6516) 2024-07-17 19:26:04 -07:00
Varun Sundar Rabindranath
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-18 01:38:35 +00:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) 2024-07-17 14:30:28 -07:00
Antoni Baum
5f0b9933e6
[Bugfix] Fix Ray Metrics API usage (#6354) 2024-07-17 19:40:10 +00:00
milo157
a38524f338
[DOC] - Add docker image to Cerebrium Integration (#6510) 2024-07-17 10:22:53 -07:00
Cody Yu
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 (#6164) 2024-07-17 09:37:16 -07:00
Woosuk Kwon
a9a2e74d21
[Misc] Use torch.Tensor for type annotation (#6505) 2024-07-17 13:01:10 +00:00
Woosuk Kwon
e09ce759aa
[TPU] Remove multi-modal args in TPU backend (#6504) 2024-07-17 04:02:53 -07:00
Murali Andoorveedu
5fa6e9876e
[Bugfix] Fix for multinode crash on 4 PP (#6495)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-17 08:25:10 +00:00
Cyrus Leung
5bf35a91e4
[Doc][CI/Build] Update docs and tests to use vllm serve (#6431) 2024-07-17 07:43:21 +00:00
shangmingc
a19e8d3726
[Misc][Speculative decoding] Typos and typing fixes (#6467)
Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com>
2024-07-17 07:17:07 +00:00
Hongxia Yang
10383887e0
[ROCm] Cleanup Dockerfile and remove outdated patch (#6482) 2024-07-16 22:47:02 -07:00
Wushi Dong
1d094fd7c0
[Distributed][PP] only create embedding & lm head when necessary (#6455)
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
2024-07-16 19:20:26 -07:00
youkaichao
ce37be7ba0
[misc][distributed] add seed to dummy weights (#6491) 2024-07-16 19:16:34 -07:00
youkaichao
7f62077af5
[misc][distributed] improve tests (#6488) 2024-07-16 17:35:52 -07:00
youkaichao
09c2eb85dd
[ci][distributed] add pipeline parallel correctness test (#6410) 2024-07-16 15:44:22 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) 2024-07-16 15:31:32 -07:00
Cody Yu
160e1d8c99
[Misc] Log spec decode metrics (#6454) 2024-07-16 20:37:10 +00:00
Jiaxin Shan
94162beb9f
[Doc] Fix the lora adapter path in server startup script (#6230) 2024-07-16 10:11:04 -07:00
Woosuk Kwon
c467dff24f
[Hardware][TPU] Support MoE with Pallas GMM kernel (#6457) 2024-07-16 09:56:28 -07:00
youkaichao
9f4ccec761
[doc][misc] remind to cancel debugging environment variables (#6481)
[doc][misc] remind users to cancel debugging environment variables after debugging (#6481)
2024-07-16 09:45:30 -07:00
Cyrus Leung
38ef94888a
[CI/Build] Remove "boardwalk" image asset (#6460) 2024-07-16 08:59:36 -07:00
Peng Guanwen
2bb0489cb3
[Core] Use numpy to speed up padded token processing (#6442) 2024-07-16 08:13:25 -07:00
Thomas Parnell
7508a3dc34
[Misc] Fix typos in spec. decode metrics logging. (#6470)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-16 13:55:15 +00:00
sasha0552
7a3d2a5b95
[Frontend] Support for chat completions input in the tokenize endpoint (#5923) 2024-07-16 20:18:09 +08:00
Cyrus Leung
d97011512e
[CI/Build] vLLM cache directory for images (#6444) 2024-07-15 23:12:25 -07:00
Woosuk Kwon
37d776606f
[Docs] Announce 5th meetup (#6458) 2024-07-15 21:04:58 -07:00
Joe
d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419) 2024-07-15 18:54:15 -07:00
Mor Zusman
9ad32dacd9
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425)
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-07-16 01:32:55 +00:00
Kevin H. Luu
d6f3b3d5c4
Pin sphinx-argparse version (#6453)
Signed-off-by: kevin <kevin@anyscale.com>
2024-07-16 01:26:11 +00:00
Woosuk Kwon
4552e37b55
[CI/Build][TPU] Add TPU CI test (#6277)
Co-authored-by: kevin <kevin@anyscale.com>
2024-07-15 14:31:16 -07:00
Woosuk Kwon
ec9933f4a5
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289) 2024-07-15 19:02:14 +00:00
Woosuk Kwon
3dee97b05f
[Docs] Add Google Cloud to sponsor list (#6450) 2024-07-15 11:58:10 -07:00
youkaichao
4cf256ae7f
[misc][distributed] fix pp missing layer condition (#6446) 2024-07-15 10:32:35 -07:00
Simon Mo
64fdc08c72
bump version to v0.5.2 (#6433) 2024-07-15 17:27:40 +00:00
Thomas Parnell
4ef95b0f06
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-15 13:14:49 -04:00
Thomas Parnell
eaec4b9153
[Bugfix] Add custom Triton cache manager to resolve MoE MP issue (#6140)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com>
2024-07-15 10:12:47 -07:00
Pernekhan Utemuratov
a63a4c6341
[Misc] Use 0.0.9 version for flashinfer (#6447)
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
2024-07-15 10:10:26 -07:00
Tyler Michael Smith
c8fd97f26d
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270) 2024-07-15 13:05:52 -04:00
youkaichao
94b82e8c18
[doc][distributed] add suggestion for distributed inference (#6418) 2024-07-15 09:45:51 -07:00
Roger Wang
6ae1597ddf
[VLM] Minor space optimization for ClipVisionModel (#6436) 2024-07-15 17:29:51 +08:00