squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Robert Shaw	58ca663224	[ Misc ] Improve Min Capability Checking in `compressed-tensors` (#6522 )	2024-07-18 14:39:12 +00:00
Woosuk Kwon	4634c8728b	[TPU] Refactor TPU worker & model runner (#6506 )	2024-07-18 01:34:16 -07:00
Noam Gat	c8a7d51c49	[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501 )	2024-07-18 07:47:13 +00:00
Nick Hill	e2fbaee725	[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-18 15:13:30 +08:00
Cody Yu	8a74c68bd1	[Misc] Minor patch for draft model runner (#6523 )	2024-07-18 06:06:21 +00:00
Rui Qiao	61e592747c	[Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>	2024-07-17 22:27:09 -07:00
Nick Hill	d25877dd9b	[BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530 )	2024-07-17 22:24:43 -07:00
youkaichao	1c27d25fb5	[core][model] yet another cpu offload implementation (#6496 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-17 20:54:35 -07:00
Robert Shaw	18fecc3559	[ Kernel ] Fp8 Channelwise Weight Support (#6487 )	2024-07-18 03:18:13 +00:00
Cody Yu	b5af8c223c	[Model] Pipeline parallel support for Mixtral (#6516 )	2024-07-17 19:26:04 -07:00
Varun Sundar Rabindranath	b5241e41d9	[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-18 01:38:35 +00:00
Alexander Matveev	e76466dde2	[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338 )	2024-07-17 14:30:28 -07:00
Antoni Baum	5f0b9933e6	[Bugfix] Fix Ray Metrics API usage (#6354 )	2024-07-17 19:40:10 +00:00
milo157	a38524f338	[DOC] - Add docker image to Cerebrium Integration (#6510 )	2024-07-17 10:22:53 -07:00
Cody Yu	2fa4623d9e	[Core] Refactor _prepare_model_input_tensors - take 2 (#6164 )	2024-07-17 09:37:16 -07:00
Woosuk Kwon	a9a2e74d21	[Misc] Use `torch.Tensor` for type annotation (#6505 )	2024-07-17 13:01:10 +00:00
Woosuk Kwon	e09ce759aa	[TPU] Remove multi-modal args in TPU backend (#6504 )	2024-07-17 04:02:53 -07:00
Murali Andoorveedu	5fa6e9876e	[Bugfix] Fix for multinode crash on 4 PP (#6495 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-17 08:25:10 +00:00
Cyrus Leung	5bf35a91e4	[Doc][CI/Build] Update docs and tests to use `vllm serve` (#6431 )	2024-07-17 07:43:21 +00:00
shangmingc	a19e8d3726	[Misc][Speculative decoding] Typos and typing fixes (#6467 ) Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com>	2024-07-17 07:17:07 +00:00
Hongxia Yang	10383887e0	[ROCm] Cleanup Dockerfile and remove outdated patch (#6482 )	2024-07-16 22:47:02 -07:00
Wushi Dong	1d094fd7c0	[Distributed][PP] only create embedding & lm head when necessary (#6455 ) original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization	2024-07-16 19:20:26 -07:00
youkaichao	ce37be7ba0	[misc][distributed] add seed to dummy weights (#6491 )	2024-07-16 19:16:34 -07:00
youkaichao	7f62077af5	[misc][distributed] improve tests (#6488 )	2024-07-16 17:35:52 -07:00
youkaichao	09c2eb85dd	[ci][distributed] add pipeline parallel correctness test (#6410 )	2024-07-16 15:44:22 -07:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Cody Yu	160e1d8c99	[Misc] Log spec decode metrics (#6454 )	2024-07-16 20:37:10 +00:00
Jiaxin Shan	94162beb9f	[Doc] Fix the lora adapter path in server startup script (#6230 )	2024-07-16 10:11:04 -07:00
Woosuk Kwon	c467dff24f	[Hardware][TPU] Support MoE with Pallas GMM kernel (#6457 )	2024-07-16 09:56:28 -07:00
youkaichao	9f4ccec761	[doc][misc] remind to cancel debugging environment variables (#6481 ) [doc][misc] remind users to cancel debugging environment variables after debugging (#6481)	2024-07-16 09:45:30 -07:00
Cyrus Leung	38ef94888a	[CI/Build] Remove "boardwalk" image asset (#6460 )	2024-07-16 08:59:36 -07:00
Peng Guanwen	2bb0489cb3	[Core] Use numpy to speed up padded token processing (#6442 )	2024-07-16 08:13:25 -07:00
Thomas Parnell	7508a3dc34	[Misc] Fix typos in spec. decode metrics logging. (#6470 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-16 13:55:15 +00:00
sasha0552	7a3d2a5b95	[Frontend] Support for chat completions input in the tokenize endpoint (#5923 )	2024-07-16 20:18:09 +08:00
Cyrus Leung	d97011512e	[CI/Build] vLLM cache directory for images (#6444 )	2024-07-15 23:12:25 -07:00
Woosuk Kwon	37d776606f	[Docs] Announce 5th meetup (#6458 )	2024-07-15 21:04:58 -07:00
Joe	d92b3c5cde	[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419 )	2024-07-15 18:54:15 -07:00
Mor Zusman	9ad32dacd9	[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425 ) Co-authored-by: Mor Zusman <morz@ai21.com>	2024-07-16 01:32:55 +00:00
Kevin H. Luu	d6f3b3d5c4	Pin sphinx-argparse version (#6453 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-07-16 01:26:11 +00:00
Woosuk Kwon	4552e37b55	[CI/Build][TPU] Add TPU CI test (#6277 ) Co-authored-by: kevin <kevin@anyscale.com>	2024-07-15 14:31:16 -07:00
Woosuk Kwon	ec9933f4a5	[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289 )	2024-07-15 19:02:14 +00:00
Woosuk Kwon	3dee97b05f	[Docs] Add Google Cloud to sponsor list (#6450 )	2024-07-15 11:58:10 -07:00
youkaichao	4cf256ae7f	[misc][distributed] fix pp missing layer condition (#6446 )	2024-07-15 10:32:35 -07:00
Simon Mo	64fdc08c72	bump version to v0.5.2 (#6433 )	2024-07-15 17:27:40 +00:00
Thomas Parnell	4ef95b0f06	[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-15 13:14:49 -04:00
Thomas Parnell	eaec4b9153	[Bugfix] Add custom Triton cache manager to resolve MoE MP issue (#6140 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com>	2024-07-15 10:12:47 -07:00
Pernekhan Utemuratov	a63a4c6341	[Misc] Use 0.0.9 version for flashinfer (#6447 ) Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>	2024-07-15 10:10:26 -07:00
Tyler Michael Smith	c8fd97f26d	[Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270 )	2024-07-15 13:05:52 -04:00
youkaichao	94b82e8c18	[doc][distributed] add suggestion for distributed inference (#6418 )	2024-07-15 09:45:51 -07:00
Roger Wang	6ae1597ddf	[VLM] Minor space optimization for `ClipVisionModel` (#6436 )	2024-07-15 17:29:51 +08:00

... 3 4 5 6 7 ...

2172 Commits