squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Matt Wong	06d6c5fe9f	[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543 )	2024-07-20 09:39:07 -07:00
Robert Shaw	683e3cb9c4	[ Misc ] `fbgemm` checkpoints (#6559 )	2024-07-20 09:36:57 -07:00
Robert Shaw	4cc24f01b1	[ Kernel ] Enable Dynamic Per Token `fp8` (#6547 )	2024-07-19 23:08:15 +00:00
Robert Shaw	dbe5588554	[ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515 )	2024-07-18 22:39:18 -04:00
Nick Hill	b5672a112c	[Core] Multiprocessing Pipeline Parallel support (#6130 ) Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-18 19:15:52 -07:00
Tyler Michael Smith	1689219ebf	[CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517 )	2024-07-18 17:29:25 -07:00
youkaichao	f53b8f0d05	[ci][test] add correctness test for cpu offloading (#6549 )	2024-07-18 23:41:06 +00:00
Rui Qiao	61e592747c	[Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>	2024-07-17 22:27:09 -07:00
youkaichao	1c27d25fb5	[core][model] yet another cpu offload implementation (#6496 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-17 20:54:35 -07:00
youkaichao	09c2eb85dd	[ci][distributed] add pipeline parallel correctness test (#6410 )	2024-07-16 15:44:22 -07:00
Cyrus Leung	d97011512e	[CI/Build] vLLM cache directory for images (#6444 )	2024-07-15 23:12:25 -07:00
Woosuk Kwon	4552e37b55	[CI/Build][TPU] Add TPU CI test (#6277 ) Co-authored-by: kevin <kevin@anyscale.com>	2024-07-15 14:31:16 -07:00
youkaichao	69672f116c	[core][distributed] simplify code to support pipeline parallel (#6406 )	2024-07-14 21:20:51 -07:00
Robert Shaw	a754dc2cb9	[CI/Build] Cross python wheel (#6394 )	2024-07-14 18:54:46 -07:00
Robert Shaw	73030b7dae	[ Misc ] Enable Quantizing All Layers of DeekSeekv2 (#6423 )	2024-07-14 21:38:42 +00:00
youkaichao	ccd3c04571	[ci][build] fix commit id (#6420 ) Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-07-14 22:16:21 +08:00
Tyler Michael Smith	9dad5cc859	[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384 )	2024-07-14 13:37:19 +00:00
Robert Shaw	fb6af8bc08	[ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417 )	2024-07-13 20:03:58 -07:00
Robert Shaw	babf52dade	[ Misc ] More Cleanup of Marlin (#6359 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-07-13 10:21:37 +00:00
youkaichao	41708e5034	[ci] try to add multi-node tests (#6280 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-12 21:51:48 -07:00
Simon Mo	6bc9710f6e	Fix release pipeline's dir permission (#6391 )	2024-07-12 15:52:43 -07:00
Simon Mo	21b2dcedab	Fix release pipeline's -e flag (#6390 )	2024-07-12 14:08:04 -07:00
Simon Mo	07b35af86d	Fix interpolation in release pipeline (#6389 )	2024-07-12 14:03:39 -07:00
Simon Mo	bb1a784b05	Fix release-pipeline.yaml (#6388 )	2024-07-12 14:00:57 -07:00
Simon Mo	d719ba24c5	Build some nightly wheels by default (#6380 )	2024-07-12 13:56:59 -07:00
Kevin H. Luu	b75bce1008	[ci] Add grouped tests & mark tests to run by default for fastcheck pipeline (#6365 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-07-12 09:58:38 -07:00
Alexei-V-Ivanov-AMD	f9d25c2519	[Build/CI] Checking/Waiting for the GPU's clean state (#6379 )	2024-07-12 09:42:24 -07:00
Robert Shaw	aea19f0989	[ Misc ] Support Models With Bias in `compressed-tensors` integration (#6356 )	2024-07-12 11:11:29 -04:00
adityagoel14	d26a8b3f1f	[CI/Build] (2/2) Switching AMD CI to store images in Docker Hub (#6350 )	2024-07-11 21:26:26 -07:00
Lily Liu	d6ab528997	[Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351 )	2024-07-12 01:32:06 +00:00
Robert Shaw	7ed6a4f0e1	[ BugFix ] Prompt Logprobs Detokenization (#6223 ) Co-authored-by: Zifei Tong <zifeitong@gmail.com>	2024-07-11 22:02:29 +00:00
Kuntai Du	a4feba929b	[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362 )	2024-07-11 13:28:38 -07:00
Simon Mo	52b7fcb35a	Benchmark: add H100 suite (#6047 )	2024-07-11 09:17:07 -07:00
Kevin H. Luu	a0550cbc80	Add support for multi-node on CI (#5955 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-07-09 12:56:56 -07:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Simon Mo	bc96d5c330	Move release wheel env var to Dockerfile instead (#6163 )	2024-07-05 17:19:53 -07:00
Simon Mo	f0250620dd	Fix release wheel build env var (#6162 )	2024-07-05 16:24:31 -07:00
Simon Mo	2de490d60f	Update wheel builds to strip debug (#6161 )	2024-07-05 14:51:25 -07:00
Lily Liu	69ec3ca14c	[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-07-04 16:35:51 -07:00
Yuan	81d7a50f24	[Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008 ) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>	2024-07-04 15:22:12 -07:00
youkaichao	3de6e6a30e	[core][distributed] support n layers % pp size != 0 (#6115 )	2024-07-03 16:40:31 -07:00
Mor Zusman	9d6a8daa87	[Model] Jamba support (#4115 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai> Co-authored-by: Erez Schwartz <erezs@ai21.com> Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-02 23:11:29 +00:00
Robert Shaw	7c008c51a9	[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-02 21:54:35 +00:00
Murali Andoorveedu	c5832d2ae9	[Core] Pipeline Parallel Support (#4412 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-02 10:58:08 -07:00
xwjiang2010	98d6682cd1	[VLM] Remove `image_input_type` from VLM config (#5852 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-02 07:57:09 +00:00
Robert Shaw	d76084c12f	[ CI ] Re-enable Large Model LM Eval (#6031 )	2024-07-01 12:40:45 -04:00
Robert Shaw	deacb7ec44	[ CI ] Temporarily Disable Large LM-Eval Tests (#6005 ) Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>	2024-06-30 11:56:56 -07:00
SangBin Cho	f5e73c9f1b	[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909 ) Co-authored-by: sang <sangcho@anyscale.com>	2024-06-30 17:11:15 +00:00
youkaichao	2be6955a3f	[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991)	2024-06-30 08:06:13 +00:00
Cyrus Leung	9d47f64eb6	[CI/Build] [3/3] Reorganize entrypoints tests (#5966 )	2024-06-30 12:58:49 +08:00
Roger Wang	bcc6a09b63	[CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989 )	2024-06-30 09:18:31 +08:00
Robert Shaw	75aa1442db	[ CI/Build ] LM Eval Harness Based CI Testing (#5838 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-29 13:04:30 -04:00
Cyrus Leung	99397da534	[CI/Build] Add TP test for vision models (#5892 )	2024-06-29 15:45:54 +00:00
Lily Liu	7041de4384	[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>	2024-06-28 15:28:49 -07:00
Ilya Lavrenov	57f09a419c	[Hardware][Intel] OpenVINO vLLM backend (#5379 )	2024-06-28 13:50:16 +00:00
xwjiang2010	d12af207d2	[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (#5880 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-06-27 15:15:24 +08:00
Woo-Yeon Lee	2ce5d6688b	[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414 )	2024-06-25 09:56:06 +00:00
Kevin H. Luu	e9de9dd551	[ci] Remove aws template (#5757 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-24 21:09:02 -07:00
Kunshang Ji	cf90ae0123	[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616 )	2024-06-21 17:09:34 -07:00
youkaichao	7187507301	[ci][test] fix ca test in main (#5746 )	2024-06-21 14:04:26 -07:00
youkaichao	d9a252bc8e	[Core][Distributed] add shm broadcast (#5399 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-06-21 05:12:35 +00:00
youkaichao	6c5b7af152	[distributed][misc] use fork by default for mp (#5669 )	2024-06-20 17:06:34 -07:00
Kevin H. Luu	949e49a685	[ci] Limit num gpus if specified for A100 (#5694 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-19 16:30:03 -07:00
youkaichao	d571ca0108	[ci][distributed] add tests for custom allreduce (#5689 )	2024-06-19 20:16:04 +00:00
Kevin H. Luu	3ee5c4bca5	[ci] Add A100 queue into AWS CI template (#5648 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-19 08:42:13 -06:00
Kevin H. Luu	19091efc44	[ci] Setup Release pipeline and build release wheels with cache (#5610 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-18 11:00:36 -07:00
Ronen Schaffer	7879f24dcc	[Misc] Add OpenTelemetry support (#4687 ) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here	2024-06-19 01:17:03 +09:00
Kevin H. Luu	13db4369d9	[ci] Deprecate original CI template (#5624 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-18 14:26:20 +00:00
Roger Wang	4ad7b53e59	[CI/Build][Misc] Update Pytest Marker for VLMs (#5623 )	2024-06-18 13:10:04 +00:00
Kuntai Du	114d7270ff	[CI] Avoid naming different metrics with the same name in performance benchmark (#5615 )	2024-06-17 21:37:18 -07:00
Cyrus Leung	32c86e494a	[Misc] Fix typo (#5618 )	2024-06-17 20:58:30 -07:00
Kuntai Du	9e4e6fe207	[CI] the readability of benchmarking and prepare for dashboard (#5571 ) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571)	2024-06-17 11:41:08 -07:00
Jie Fu (傅杰)	ab66536dbf	[CI/BUILD] Support non-AVX512 vLLM building and testing (#5574 )	2024-06-17 14:36:10 -04:00
Kunshang Ji	728c4c8a06	[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-06-17 11:01:25 -07:00
Antoni Baum	f31c1f90e3	Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518 )	2024-06-16 07:48:02 +00:00
Simon Mo	bd7efe95d0	Add ccache to amd (#5555 )	2024-06-14 17:18:22 -07:00
Cyrus Leung	d47af2bc02	[CI/Build] Disable LLaVA-NeXT CPU test (#5529 )	2024-06-14 09:27:30 -07:00
Kuntai Du	319ad7f1d3	[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (#5073 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-06-13 22:36:20 -07:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Kevin H. Luu	916d219d62	[ci] Use sccache to build images (#5419 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-12 17:58:12 -07:00
Kevin H. Luu	8b82a89997	[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-12 14:00:18 -07:00
youkaichao	c4bd03c7c5	[Core][Distributed] add same-node detection (#5369 )	2024-06-11 10:53:59 -07:00
Kevin H. Luu	76477a93b7	[ci] Fix Buildkite agent path (#5392 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-10 18:58:07 -07:00
Kevin H. Luu	c5602f0baa	[ci] Mount buildkite agent on Docker container to upload benchmark results (#5330 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-10 09:22:34 -07:00
Kevin H. Luu	f7f9c5f97b	[ci] Use small_cpu_queue for doc build (#5331 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-10 09:21:11 -07:00
Antoni Baum	ccdc490dda	[Core] Change LoRA embedding sharding to support loading methods (#5038 )	2024-06-06 19:07:57 -07:00
Cyrus Leung	89c920785f	[CI/Build] Update vision tests (#5307 )	2024-06-06 05:17:18 -05:00
Simon Mo	3a6ae1d33c	[CI] Disable flash_attn backend for spec decode (#5286 )	2024-06-05 15:49:27 -07:00
Tyler Michael Smith	02cc3b51a7	[misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263 )	2024-06-05 10:17:51 -07:00
Simon Mo	d5b1eb081e	[CI] Add nightly benchmarks (#5260 )	2024-06-05 09:42:08 -07:00
Simon Mo	9ca62d8668	[CI] mark AMD test as softfail to prevent blockage (#5256 )	2024-06-04 11:34:53 -07:00
Li, Jiang	45c35f0d58	[CI/Build] Reducing CPU CI execution time (#5241 )	2024-06-04 10:26:40 -07:00
Cyrus Leung	ec784b2526	[CI/Build] Add inputs tests (#5215 )	2024-06-03 21:01:46 -07:00
Kevin H. Luu	4f0d17c05c	New CI template on AWS stack (#5110 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-03 16:16:43 -07:00
Yuan	cafb8e06c5	[CI/BUILD] enable intel queue for longer CPU tests (#4113 )	2024-06-03 10:39:50 -07:00
youkaichao	f758505c73	[CI/Build] increase wheel size limit to 200 MB (#5130 )	2024-05-30 06:29:48 -07:00
omkar kakarparthi	e07aff9e52	[CI/Build] Docker cleanup functionality for amd servers (#5112 ) Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Co-authored-by: omkarkakarparthi <okakarpa>	2024-05-30 03:27:39 +00:00
youkaichao	5bd3c65072	[Core][Optimization] remove vllm-nccl (#5091 )	2024-05-29 05:13:52 +00:00
Cyrus Leung	5ae5ed1e60	[Core] Consolidate prompt arguments to LLM engines (#4328 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-05-28 13:29:31 -07:00
Letian Li	2ba80bed27	[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined (#5009 )	2024-05-23 09:08:58 -07:00

1 2 3 4 5

214 Commits