squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Robert Shaw	4cc24f01b1	[ Kernel ] Enable Dynamic Per Token `fp8` (#6547 )	2024-07-19 23:08:15 +00:00
Thomas Parnell	f0bbfaf917	[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578 )	2024-07-19 14:01:03 -07:00
Antoni Baum	9ed82e7074	[Misc] Small perf improvements (#6520 )	2024-07-19 12:10:56 -07:00
Thomas Parnell	a5314e8698	[Model] RowParallelLinear: pass bias to quant_method.apply (#6327 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-19 07:15:22 -06:00
Woo-Yeon Lee	a921e86392	[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369 )	2024-07-19 06:01:09 -07:00
Cyrus Leung	6366efc67b	[Bugfix][Frontend] Fix missing `/metrics` endpoint (#6463 )	2024-07-19 03:55:13 +00:00
Thomas Parnell	d4201e06d5	[Bugfix] Make spec. decode respect per-request seed. (#6034 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-07-18 19:22:08 -07:00
Nick Hill	b5672a112c	[Core] Multiprocessing Pipeline Parallel support (#6130 ) Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-18 19:15:52 -07:00
youkaichao	f53b8f0d05	[ci][test] add correctness test for cpu offloading (#6549 )	2024-07-18 23:41:06 +00:00
Nick Hill	e2fbaee725	[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-18 15:13:30 +08:00
Cody Yu	b5af8c223c	[Model] Pipeline parallel support for Mixtral (#6516 )	2024-07-17 19:26:04 -07:00
Varun Sundar Rabindranath	b5241e41d9	[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-18 01:38:35 +00:00
Alexander Matveev	e76466dde2	[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338 )	2024-07-17 14:30:28 -07:00
Antoni Baum	5f0b9933e6	[Bugfix] Fix Ray Metrics API usage (#6354 )	2024-07-17 19:40:10 +00:00
Cody Yu	2fa4623d9e	[Core] Refactor _prepare_model_input_tensors - take 2 (#6164 )	2024-07-17 09:37:16 -07:00
Murali Andoorveedu	5fa6e9876e	[Bugfix] Fix for multinode crash on 4 PP (#6495 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-17 08:25:10 +00:00
Cyrus Leung	5bf35a91e4	[Doc][CI/Build] Update docs and tests to use `vllm serve` (#6431 )	2024-07-17 07:43:21 +00:00
youkaichao	7f62077af5	[misc][distributed] improve tests (#6488 )	2024-07-16 17:35:52 -07:00
youkaichao	09c2eb85dd	[ci][distributed] add pipeline parallel correctness test (#6410 )	2024-07-16 15:44:22 -07:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Cody Yu	160e1d8c99	[Misc] Log spec decode metrics (#6454 )	2024-07-16 20:37:10 +00:00
Cyrus Leung	38ef94888a	[CI/Build] Remove "boardwalk" image asset (#6460 )	2024-07-16 08:59:36 -07:00
sasha0552	7a3d2a5b95	[Frontend] Support for chat completions input in the tokenize endpoint (#5923 )	2024-07-16 20:18:09 +08:00
Cyrus Leung	d97011512e	[CI/Build] vLLM cache directory for images (#6444 )	2024-07-15 23:12:25 -07:00
Joe	d92b3c5cde	[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419 )	2024-07-15 18:54:15 -07:00
Mor Zusman	9ad32dacd9	[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425 ) Co-authored-by: Mor Zusman <morz@ai21.com>	2024-07-16 01:32:55 +00:00
Thomas Parnell	4ef95b0f06	[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-15 13:14:49 -04:00
youkaichao	69672f116c	[core][distributed] simplify code to support pipeline parallel (#6406 )	2024-07-14 21:20:51 -07:00
zifeitong	b47008b4d2	[BugFix] BatchResponseData body should be optional (#6345 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-15 04:06:09 +00:00
Ethan Xu	dbfe254eda	[Feature] vLLM CLI (#5090 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-07-14 15:36:43 -07:00
Isotr0py	540c0368b1	[Model] Initialize Fuyu-8B support (#3924 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-14 05:27:14 +00:00
youkaichao	41708e5034	[ci] try to add multi-node tests (#6280 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-12 21:51:48 -07:00
Michael Goin	111fc6e7ec	[Misc] Add generated git commit hash as `vllm.__commit__` (#6386 )	2024-07-12 22:52:15 +00:00
Yihuan Bu	b039cbbce3	[Misc] add fixture to guided processor tests (#6341 )	2024-07-12 09:55:39 -07:00
Cyrus Leung	024ad87cdc	[Bugfix] Fix dtype mismatch in PaliGemma (#6367 )	2024-07-12 08:22:18 -07:00
Robert Shaw	aea19f0989	[ Misc ] Support Models With Bias in `compressed-tensors` integration (#6356 )	2024-07-12 11:11:29 -04:00
Hongxia Yang	b6c16cf8ff	[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352 )	2024-07-11 21:30:46 -07:00
Lily Liu	d6ab528997	[Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351 )	2024-07-12 01:32:06 +00:00
Robert Shaw	7ed6a4f0e1	[ BugFix ] Prompt Logprobs Detokenization (#6223 ) Co-authored-by: Zifei Tong <zifeitong@gmail.com>	2024-07-11 22:02:29 +00:00
xwjiang2010	1df43de9bb	[bug fix] Fix llava next feature size calculation. (#6339 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-07-11 17:21:10 +00:00
Robert Shaw	b675069d74	[ Misc ] Refactor Marlin Python Utilities (#6082 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-07-11 15:40:11 +00:00
sroy745	ae151d73be	[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765 )	2024-07-10 16:02:47 -07:00
youkaichao	da78caecfa	[core][distributed] zmq fallback for broadcasting large objects (#6183 ) [core][distributed] add zmq fallback for broadcasting large objects (#6183)	2024-07-09 18:49:11 -07:00
Abhinav Goyal	2416b26e11	[Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978 )	2024-07-09 18:34:02 -07:00
Swapnil Parekh	4d6ada947c	[CORE] Adding support for insertion of soft-tuned prompts (#4645 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com> Co-authored-by: Joe G <joseph.granados@h2o.ai> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-09 13:26:36 -07:00
tomeras91	ddc369fba1	[Bugfix] Mamba cache Cuda Graph padding (#6214 )	2024-07-08 11:25:51 -07:00
afeldman-nm	543aa48573	[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-08 17:12:15 +00:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Roger Wang	6206dcb29e	[Model] Add PaliGemma (#5189 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-07 09:25:50 +08:00
jvlunteren	f1e15da6fe	[Frontend] Continuous usage stats in OpenAI completion API (#5742 )	2024-07-05 10:37:09 -07:00

1 2 3 4 5 ...

513 Commits