squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Pooya Davoodi	8da48e4d95	[Frontend] Publish Prometheus metrics in run_batch API (#7641 )	2024-08-23 23:04:22 -07:00
Pooya Davoodi	6885fde317	[Bugfix] Fix run_batch logger (#7640 )	2024-08-23 13:58:26 -07:00
Alexander Matveev	9db93de20c	[Core] Add multi-step support to LLMEngine (#7789 )	2024-08-23 12:45:53 -07:00
Simon Mo	09c7792610	Bump version to v0.5.5 (#7823 )	2024-08-23 11:35:33 -07:00
Dipika Sikka	f1df5dbfd6	[Misc] Update `marlin` to use vLLMParameters (#7803 )	2024-08-23 14:30:52 -04:00
youkaichao	35ee2ad6b9	[github][misc] promote asking llm first (#7809 )	2024-08-23 09:38:50 -07:00
Maximilien de Bayser	e25fee57c2	[BugFix] Fix server crash on empty prompt (#7746 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2024-08-23 13:12:44 +00:00
Jie Fu (傅杰)	faeddb565d	[misc] Add Torch profiler support for CPU-only devices (#7806 )	2024-08-23 05:46:25 +00:00
Kunshang Ji	fc5ebbd1d3	[Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712 )	2024-08-22 20:06:54 -07:00
SangBin Cho	c01a6cb231	[Ray backend] Better error when pg topology is bad. (#7584 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-22 17:44:25 -07:00
Joe Runde	b903e1ba7f	[Frontend] error suppression cleanup (#7786 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-22 21:50:21 +00:00
Siyuan Liu	a152246428	[Misc] fix typo in triton import warning (#7794 )	2024-08-22 13:51:23 -07:00
Kevin H. Luu	666ad0aa16	[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args (#7705 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-08-22 20:10:55 +00:00
Michael Goin	15310b5101	[Bugfix] Use LoadFormat values for `vllm serve --load-format` (#7784 )	2024-08-22 11:37:08 -07:00
Peter Salas	57792ed469	[Doc] Fix incorrect docs from #7615 (#7788 )	2024-08-22 10:02:06 -07:00
Jiaxin Shan	d3b5b98021	[Misc] Enhance prefix-caching benchmark tool (#6568 )	2024-08-22 09:32:02 -07:00
Travis Johnson	cc0eaf12b1	[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-08-22 09:33:48 -04:00
Dipika Sikka	955b5191c9	[Misc] update fp8 to use `vLLMParameter` (#7437 )	2024-08-22 08:36:18 -04:00
Lucas Wilkinson	55d63b1211	[Bugfix] Don't build machete on cuda <12.0 (#7757 )	2024-08-22 08:28:52 -04:00
Flex Wang	4f419c00a6	Fix ShardedStateLoader for vllm fp8 quantization (#7708 )	2024-08-22 08:25:04 -04:00
Abhinav Goyal	a3fce56b88	[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830 )	2024-08-22 02:42:24 -07:00
Woosuk Kwon	b3856bef7d	[Misc] Use torch.compile for GemmaRMSNorm (#7642 )	2024-08-22 01:14:13 -07:00
youkaichao	8c6f694a79	[ci] refine dependency for distributed tests (#7776 )	2024-08-22 00:54:15 -07:00
Woosuk Kwon	eeee1c3b1a	[TPU] Avoid initializing TPU runtime in is_tpu (#7763 )	2024-08-21 21:31:49 -07:00
Michael Goin	aae74ef95c	Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 )" (#7764 )	2024-08-22 03:42:14 +00:00
Joe Runde	cde9183b40	[Bug][Frontend] Improve ZMQ client robustness (#7443 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-22 02:18:11 +00:00
zifeitong	df1a21131d	[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710 )	2024-08-22 09:36:24 +08:00
Luka Govedič	7937009a7e	[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-21 20:18:00 -04:00
Gregory Shtrasberg	9984605412	[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility (#7477 ) Co-authored-by: Charlie Fu <Charlie.Fu@amd.com>	2024-08-21 16:47:36 -07:00
youkaichao	7eebe8ccaa	[distributed][misc] error on same VLLM_HOST_IP setting (#7756 )	2024-08-21 16:25:34 -07:00
Dipika Sikka	8678a69ab5	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-21 16:17:10 -07:00
William Lin	5844017285	[ci] [multi-step] narrow multi-step test dependency paths (#7760 )	2024-08-21 15:52:40 -07:00
Peter Salas	1ca0d4f86b	[Model] Add UltravoxModel and UltravoxConfig (#7615 )	2024-08-21 22:49:39 +00:00
William Lin	dd53c4b023	[misc] Add Torch profiler support (#7451 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-21 15:39:26 -07:00
Robert Shaw	970dfdc01d	[Frontend] Improve Startup Failure UX (#7716 )	2024-08-21 19:53:01 +00:00
William Lin	91f4522cbf	[multi-step] Raise error if not using async engine (#7703 )	2024-08-21 11:49:19 -07:00
sasha0552	1b32e02648	[Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730 )	2024-08-21 11:17:48 -07:00
Robert Shaw	f7e3b0c5aa	[Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend (#7394 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-21 13:34:14 -04:00
Brian Li	d3c002eadc	[Bugfix] chat method add_generation_prompt param (#7734 )	2024-08-21 17:33:35 +00:00
Nick Hill	9b73a2f498	[Spec Decoding] Use target model max length as default for draft model (#7706 )	2024-08-22 00:23:22 +08:00
Isotr0py	6925cdbeea	[Bugfix][Hardware][CPU] Fix `mm_limits` initialization for CPU backend (#7735 )	2024-08-21 16:23:03 +00:00
LI MOU	53328d7536	[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509 )	2024-08-21 08:54:31 -07:00
Nick Hill	c75363fbc0	[BugFix] Avoid premature async generator exit and raise all exception variations (#7698 )	2024-08-21 11:45:55 -04:00
sasha0552	dd3fa0e430	[Bugfix] Mirror jinja2 in pyproject.toml (#7723 )	2024-08-21 13:41:17 +00:00
Cyrus Leung	baaedfdb2d	[mypy] Enable following imports for entrypoints (#7248 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Fei <dfdfcai4@gmail.com>	2024-08-20 23:28:21 -07:00
Roger Wang	4506641212	[Doc] Section for Multimodal Language Models (#7719 )	2024-08-20 23:24:01 -07:00
Isotr0py	12e1c65bc9	[Model] Add AWQ quantization support for InternVL2 model (#7187 )	2024-08-20 23:18:57 -07:00
youkaichao	b74a125800	[ci] try to log process using the port to debug the port usage (#7711 )	2024-08-20 17:41:12 -07:00
Antoni Baum	66a9e713a7	[Core] Pipe `worker_class_fn` argument in Executor (#7707 )	2024-08-21 00:37:39 +00:00
youkaichao	9e51b6a626	[ci][test] adjust max wait time for cpu offloading test (#7709 )	2024-08-20 17:12:44 -07:00

1 2 3 4 5 ...

2445 Commits