squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Kaunil Dhruv	058344f89a	[Frontend]-config-cli-args (#7737 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>	2024-08-30 08:21:02 -07:00
Cyrus Leung	98cef6a227	[Core] Increase default `max_num_batched_tokens` for multimodal models (#8028 )	2024-08-30 08:20:34 -07:00
Jungho Christopher Cho	f97be32d1d	[VLM][Model] TP support for ViTs (#7186 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-08-30 08:19:27 -07:00
Cyrus Leung	afd39a4511	[Bugfix] Fix import error in Exaone model (#8034 )	2024-08-30 08:03:28 -07:00
Richard Liu	2148441fd3	[TPU] Support single and multi-host TPUs on GKE (#7613 )	2024-08-30 00:27:40 -07:00
Yohan Na	dc13e99348	[MODEL] add Exaone model support (#7819 )	2024-08-29 23:34:20 -07:00
Avshalom Manevich	34a0e96d46	[Kernel] changing fused moe kernel chunk size default to 32k (#7995 )	2024-08-30 04:11:39 +00:00
Woosuk Kwon	80c7b089b1	[TPU] Async output processing for TPU (#8011 )	2024-08-29 19:35:29 -07:00
afeldman-nm	428dd1445e	[Core] Logprobs support in Multi-step (#7652 )	2024-08-29 19:19:08 -07:00
Cyrus Leung	4abed65c58	[VLM] Disallow overflowing `max_model_len` for multimodal models (#7998 )	2024-08-29 17:49:04 -07:00
Wei-Sheng Chin	0c785d344d	Add more percentiles and latencies (#7759 )	2024-08-29 16:48:11 -07:00
chenqianfzh	4664ceaad6	support bitsandbytes 8-bit and FP4 quantized models (#7445 )	2024-08-29 19:09:08 -04:00
Harsha vardhan manoj Bikki	257afc37c5	[Neuron] Adding support for context-lenght, token-gen buckets. (#7885 ) Co-authored-by: Harsha Bikki <harbikh@amazon.com>	2024-08-29 13:58:14 -07:00
Dipika Sikka	86a677de42	[misc] update tpu int8 to use new vLLM Parameters (#7973 )	2024-08-29 16:46:55 -04:00
Isotr0py	d78789ac16	[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954 )	2024-08-29 15:54:49 -04:00
kushanam	c334b1898b	extend cuda graph size for H200 (#7894 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-29 12:15:04 -07:00
Pavani Majety	6b3421567d	[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-29 14:53:11 -04:00
Alexander Matveev	3f60f2244e	[Core] Combine async postprocessor and multi-step (#7921 )	2024-08-29 11:18:26 -07:00
Jonas M. Kübler	f205c09854	[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899 )	2024-08-28 22:18:13 -07:00
youkaichao	ef99a78760	Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982 )	2024-08-28 21:27:06 -07:00
Peter Salas	74d5543ec5	[VLM][Core] Fix exceptions on ragged NestedTensors (#7974 )	2024-08-29 03:24:31 +00:00
youkaichao	a7f65c2be9	[torch.compile] remove reset (#7975 )	2024-08-28 17:32:26 -07:00
Nick Hill	4289cad37f	[Frontend] Minor optimizations to zmq decoupled front-end (#7957 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-08-28 17:22:43 -07:00
Michael Goin	af59df0a10	Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961 )	2024-08-28 19:19:17 -04:00
youkaichao	ce6bf3a2cf	[torch.compile] avoid Dynamo guard evaluation overhead (#7898 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-08-28 16:10:12 -07:00
bnellnm	3cdfe1f38b	[Bugfix] Make torch registration of punica ops optional (#7970 )	2024-08-28 16:11:49 -06:00
Mor Zusman	fdd9daafa3	[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651 )	2024-08-28 15:06:52 -07:00
Stas Bekman	8c56e57def	[Doc] fix 404 link (#7966 )	2024-08-28 13:54:23 -07:00
Woosuk Kwon	eeffde1ac0	[TPU] Upgrade PyTorch XLA nightly (#7967 )	2024-08-28 13:10:21 -07:00
rasmith	e5697d161c	[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386 )	2024-08-28 15:37:47 -04:00
Pavani Majety	b98cc28f91	[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-28 10:01:22 -07:00
Cyrus Leung	ef9baee3c5	[Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948 )	2024-08-28 08:11:18 -07:00
Stas Bekman	98c12cffe5	[Doc] fix the autoAWQ example (#7937 )	2024-08-28 12:12:32 +00:00
youkaichao	f52a43a8b9	[ci][test] fix pp test failure (#7945 )	2024-08-28 01:27:07 -07:00
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Alexander Matveev	f508e03e7f	[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911 )	2024-08-28 00:02:30 -07:00
Cyrus Leung	51f86bf487	[mypy][CI/Build] Fix mypy errors (#7929 )	2024-08-27 23:47:44 -07:00
bnellnm	c166e7e43e	[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886 )	2024-08-27 23:13:45 -04:00
youkaichao	bc6e42a9b1	[hardware][rocm] allow rocm to override default env var (#7926 )	2024-08-27 19:50:06 -07:00
Peter Salas	fab5f53e2d	[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902 )	2024-08-28 01:53:56 +00:00
Jonathan Berkhahn	9c71c97ae2	[mypy] Enable mypy type checking for `vllm/core` (#7229 )	2024-08-28 07:11:14 +08:00
zifeitong	5340a2dccf	[Model] Add multi-image input support for LLaVA-Next offline inference (#7230 )	2024-08-28 07:09:02 +08:00
Philipp Schmid	345be0e244	[benchmark] Update TGI version (#7917 )	2024-08-27 15:07:53 -07:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
youkaichao	ed6f002d33	[cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924 )	2024-08-27 12:06:11 -07:00
Isotr0py	b09c755be8	[Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916 )	2024-08-27 17:36:09 +00:00
alexeykondrat	42e932c7d4	[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237 )	2024-08-27 10:09:13 -07:00
Kunshang Ji	076169f603	[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810 )	2024-08-27 10:07:02 -07:00
Isotr0py	9db642138b	[CI/Build][VLM] Cleanup multiple images inputs model test (#7897 )	2024-08-27 15:28:30 +00:00
Patrick von Platen	6fc4e6e07a	[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739 )	2024-08-27 12:40:02 +00:00

... 2 3 4 5 6 ...

2666 Commits