squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Cyrus Leung	51f86bf487	[mypy][CI/Build] Fix mypy errors (#7929 )	2024-08-27 23:47:44 -07:00
Peter Salas	fab5f53e2d	[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902 )	2024-08-28 01:53:56 +00:00
zifeitong	5340a2dccf	[Model] Add multi-image input support for LLaVA-Next offline inference (#7230 )	2024-08-28 07:09:02 +08:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
Isotr0py	9db642138b	[CI/Build][VLM] Cleanup multiple images inputs model test (#7897 )	2024-08-27 15:28:30 +00:00
Patrick von Platen	6fc4e6e07a	[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739 )	2024-08-27 12:40:02 +00:00
youkaichao	64cc644425	[core][torch.compile] discard the compile for profiling (#7796 )	2024-08-26 21:33:58 -07:00
Nick Hill	39178c7fbc	[Tests] Disable retries and use context manager for openai client (#7565 )	2024-08-26 21:33:17 -07:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
Dipika Sikka	665304092d	[Misc] Update `qqq` to use vLLMParameters (#7805 )	2024-08-26 13:16:15 -06:00
Cody Yu	2deb029d11	[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822 )	2024-08-26 11:24:53 -07:00
Cyrus Leung	029c71de11	[CI/Build] Avoid downloading all HF files in `RemoteOpenAIServer` (#7836 )	2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	0b769992ec	[Bugfix]: Use float32 for base64 embedding (#7855 ) Signed-off-by: Hollow Man <hollowman@opensuse.org>	2024-08-26 03:16:38 +00:00
Nick Hill	1856aff4d6	[Spec Decoding] Streamline batch expansion tensor manipulation (#7851 )	2024-08-25 15:45:14 -07:00
Isotr0py	2059b8d9ca	[Misc] Remove snapshot_download usage in InternVL2 test (#7835 )	2024-08-25 15:53:09 +00:00
Isotr0py	8aaf3d5347	[Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783 )	2024-08-25 11:51:20 +00:00
zifeitong	80162c44b1	[Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840 )	2024-08-24 18:16:24 -07:00
youkaichao	aab0fcdb63	[ci][test] fix RemoteOpenAIServer (#7838 )	2024-08-24 17:31:28 +00:00
youkaichao	ea9fa160e3	[ci][test] exclude model download time in server start time (#7834 )	2024-08-24 01:03:27 -07:00
youkaichao	7d9ffa2ae1	[misc][core] lazy import outlines (#7831 )	2024-08-24 00:51:38 -07:00
Tyler Rockwood	d81abefd2e	[Frontend] add json_schema support from OpenAI protocol (#7654 )	2024-08-23 23:07:24 -07:00
Pooya Davoodi	8da48e4d95	[Frontend] Publish Prometheus metrics in run_batch API (#7641 )	2024-08-23 23:04:22 -07:00
Alexander Matveev	9db93de20c	[Core] Add multi-step support to LLMEngine (#7789 )	2024-08-23 12:45:53 -07:00
Dipika Sikka	f1df5dbfd6	[Misc] Update `marlin` to use vLLMParameters (#7803 )	2024-08-23 14:30:52 -04:00
Maximilien de Bayser	e25fee57c2	[BugFix] Fix server crash on empty prompt (#7746 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2024-08-23 13:12:44 +00:00
SangBin Cho	c01a6cb231	[Ray backend] Better error when pg topology is bad. (#7584 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-22 17:44:25 -07:00
Joe Runde	b903e1ba7f	[Frontend] error suppression cleanup (#7786 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-22 21:50:21 +00:00
Travis Johnson	cc0eaf12b1	[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-08-22 09:33:48 -04:00
Dipika Sikka	955b5191c9	[Misc] update fp8 to use `vLLMParameter` (#7437 )	2024-08-22 08:36:18 -04:00
Abhinav Goyal	a3fce56b88	[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830 )	2024-08-22 02:42:24 -07:00
Michael Goin	aae74ef95c	Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 )" (#7764 )	2024-08-22 03:42:14 +00:00
Joe Runde	cde9183b40	[Bug][Frontend] Improve ZMQ client robustness (#7443 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-22 02:18:11 +00:00
zifeitong	df1a21131d	[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710 )	2024-08-22 09:36:24 +08:00
Luka Govedič	7937009a7e	[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-21 20:18:00 -04:00
Dipika Sikka	8678a69ab5	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-21 16:17:10 -07:00
Peter Salas	1ca0d4f86b	[Model] Add UltravoxModel and UltravoxConfig (#7615 )	2024-08-21 22:49:39 +00:00
Robert Shaw	970dfdc01d	[Frontend] Improve Startup Failure UX (#7716 )	2024-08-21 19:53:01 +00:00
Robert Shaw	f7e3b0c5aa	[Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend (#7394 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-21 13:34:14 -04:00
LI MOU	53328d7536	[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509 )	2024-08-21 08:54:31 -07:00
Nick Hill	c75363fbc0	[BugFix] Avoid premature async generator exit and raise all exception variations (#7698 )	2024-08-21 11:45:55 -04:00
Cyrus Leung	baaedfdb2d	[mypy] Enable following imports for entrypoints (#7248 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Fei <dfdfcai4@gmail.com>	2024-08-20 23:28:21 -07:00
Isotr0py	12e1c65bc9	[Model] Add AWQ quantization support for InternVL2 model (#7187 )	2024-08-20 23:18:57 -07:00
youkaichao	9e51b6a626	[ci][test] adjust max wait time for cpu offloading test (#7709 )	2024-08-20 17:12:44 -07:00
Antoni Baum	3b682179dd	[Core] Add `AttentionState` abstraction (#7663 )	2024-08-20 18:50:45 +00:00
Isotr0py	aae6927be0	[VLM][Model] Add test for InternViT vision encoder (#7409 )	2024-08-20 23:10:20 +08:00
Lucas Wilkinson	5288c06aa0	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174 )	2024-08-20 07:09:33 -06:00
Abhinav Goyal	312f761232	[Speculative Decoding] Fixing hidden states handling in batch expansion (#7508 )	2024-08-19 17:58:14 -07:00
Isotr0py	7601cb044d	[Core] Support tensor parallelism for GGUF quantization (#7520 )	2024-08-19 17:30:14 -04:00
William Lin	47b65a5508	[core] Multi Step Scheduling (#7000 ) Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>	2024-08-19 13:52:13 -07:00
Cody Yu	3ac50b47d0	[MISC] Add prefix cache hit rate to metrics (#7606 )	2024-08-19 11:52:07 -07:00
Peng Guanwen	f710fb5265	[Core] Use flashinfer sampling kernel when available (#7137 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-19 03:24:03 +00:00
SangBin Cho	ff7ec82c4d	[Core] Optimize SPMD architecture with delta + serialization optimization (#7109 )	2024-08-18 17:57:20 -07:00
Alex Brooks	40e1360bb6	[CI/Build] Add text-only test for Qwen models (#7475 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-08-19 07:43:46 +08:00
Robert Shaw	e3b318216d	[ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend (#7279 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-18 20:19:48 +00:00
Roger Wang	bbf55c4805	[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530 )	2024-08-17 13:30:55 -07:00
youkaichao	832163b875	[ci][test] allow longer wait time for api server (#7629 )	2024-08-17 11:26:38 -07:00
youkaichao	5bf45db7df	[ci][test] fix engine/logger test (#7621 )	2024-08-16 23:00:59 -07:00
SangBin Cho	4706eb628e	[aDAG] Unflake aDAG + PP tests (#7600 )	2024-08-16 20:49:30 -07:00
Mahesh Keralapura	93478b63d2	[Core] Fix tracking of model forward time in case of PP>1 (#7440 ) [Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)	2024-08-16 13:46:01 -07:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Charlie Fu	e837b624f2	[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210 )	2024-08-16 10:06:30 -07:00
youkaichao	54bd9a03c4	register custom op for flash attn and use from torch.ops (#7536 )	2024-08-15 22:38:56 -07:00
jon-chuang	50b8d08dbd	[Misc/Testing] Use `torch.testing.assert_close` (#7324 )	2024-08-16 04:24:04 +00:00
Michael Goin	e165528778	[CI] Move quantization cpu offload tests out of fastcheck (#7574 )	2024-08-15 21:16:20 -07:00
nunjunj	3b19e39dc5	Chat method for offline llm (#5049 ) Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal> Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal> Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-08-15 19:41:34 -07:00
youkaichao	4cd7d47fed	[ci/test] rearrange tests and make adag test soft fail (#7572 )	2024-08-15 19:39:04 -07:00
Grant Pinkert	f878c8feb0	[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453 )	2024-08-16 02:38:08 +00:00
shangmingc	b67ae00cdb	[Misc] Add quantization config support for speculative model. (#7343 )	2024-08-15 19:34:28 -07:00
Kyle Sayers	f55a9aea45	[Misc] Revert `compressed-tensors` code reuse (#7521 )	2024-08-14 15:07:37 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00
Wallas Henrique	70b746efcf	[Misc] Deprecation Warning when setting --engine-use-ray (#7424 ) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-14 09:44:27 -07:00
youkaichao	ea49e6a3c8	[misc][ci] fix cpu test with plugins (#7489 )	2024-08-13 19:27:46 -07:00
Jee Jee Li	97992802f3	[CI/Build]Reduce the time consumption for LoRA tests (#7396 )	2024-08-13 17:27:29 -07:00
youkaichao	16422ea76f	[misc][plugin] add plugin system implementation (#7426 )	2024-08-13 16:24:17 -07:00
Kyle Sayers	373538f973	[Misc] `compressed-tensors` code reuse (#7277 )	2024-08-13 19:05:15 -04:00
youkaichao	33e5d7e6b6	[frontend] spawn engine process from api server process (#7484 )	2024-08-13 15:40:17 -07:00
Dipika Sikka	b1e5afc3e7	[Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` (#7422 )	2024-08-13 17:08:20 -04:00
Dipika Sikka	fb377d7e74	[Misc] Update `gptq_marlin` to use new vLLMParameters (#7281 )	2024-08-13 14:30:11 -04:00
Peter Salas	00c3d68e45	[Frontend][Core] Add plumbing to support audio language models (#7446 )	2024-08-13 17:39:33 +00:00
Cyrus Leung	7025b11d94	[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410 )	2024-08-13 05:33:41 +00:00
Andrew Wang	97a6be95ba	[Misc] improve logits processors logging message (#7435 )	2024-08-13 02:29:34 +00:00
Cyrus Leung	9ba85bc152	[mypy] Misc. typing improvements (#7417 )	2024-08-13 09:20:20 +08:00
Rui Qiao	198d6a2898	[Core] Shut down aDAG workers with clean async llm engine exit (#7224 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-12 17:57:16 -07:00
jon-chuang	a046f86397	[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-12 22:47:41 +00:00
Roger Wang	e6e42e4b17	[Core][VLM] Support image embeddings as input (#6613 )	2024-08-12 16:16:06 +08:00
Isotr0py	4c5d8e8ea9	[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392 )	2024-08-10 16:19:33 +00:00
Cade Daniel	baa240252e	[Core] Fix edge case in chunked prefill + block manager v2 (#7380 )	2024-08-09 23:48:49 +00:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
Pooya Davoodi	249b88228d	[Frontend] Support embeddings in the run_batch API (#7132 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-09 09:48:21 -07:00
Nick Hill	b4e9528f95	[Core] Streamline stream termination in `AsyncLLMEngine` (#7336 )	2024-08-09 07:06:36 +00:00
William Lin	57b7be0e1c	[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971 )	2024-08-09 05:42:45 +00:00
Travis Johnson	99b4cf5f23	[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-08-08 22:08:46 -07:00
Cyrus Leung	7eb4a51c5f	[Core] Support serving encoder/decoder models (#7258 )	2024-08-09 10:39:41 +08:00
Zach Zheng	782e53ab59	[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849 )	2024-08-08 10:43:30 -07:00
Joe Runde	21b9c49aa3	[Frontend] Kill the server on engine death (#6594 ) Signed-off-by: Joe Runde <joe@joerun.de> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-08 09:47:48 -07:00
Luka Govedič	5fb4a3f678	[Bugfix][Kernel] Increased atol to fix failing tests (#7305 )	2024-08-08 12:16:13 -04:00
Michael Goin	5223199e03	[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219 )	2024-08-07 11:23:12 -07:00
Maximilien de Bayser	fde47d3bc2	[BugFix] Fix frontend multiprocessing hang (#7217 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-08-07 18:09:36 +00:00
Isotr0py	b764547616	[Bugfix] Fix input processor for InternVL2 model (#7164 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-07 09:32:07 -07:00
Dipika Sikka	0f7052bc7e	[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874 )	2024-08-07 09:17:58 -07:00
Cyrus Leung	66d617e343	[Frontend] Gracefully handle missing chat template and fix CI failure (#7238 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-08-07 09:12:05 +00:00
Nick Hill	9a3f49ae07	[BugFix] Overhaul async request cancellation (#7111 )	2024-08-07 13:21:41 +08:00
Michael Goin	f9a5600649	[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225 )	2024-08-06 18:34:26 -07:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
Luka Govedič	8d59dbb000	[Kernel] Add per-tensor and per-token AZP epilogues (#5941 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-08-06 18:17:08 +00:00
Lily Liu	5c60c8c423	[SpecDecode] [Minor] Fix spec decode sampler tests (#7183 )	2024-08-06 10:40:32 -07:00
Cyrus Leung	1f26efbb3a	[Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-08-06 16:55:31 +08:00
Jee Jee Li	9118217f58	[LoRA] Relax LoRA condition (#7146 )	2024-08-06 01:57:25 +00:00
Isotr0py	360bd67cf0	[Core] Support loading GGUF model (#5191 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-05 17:54:23 -06:00
youkaichao	dfb1a15dcb	[ci][frontend] deduplicate tests (#7101 )	2024-08-05 15:59:22 -07:00
Cade Daniel	82a1b1a82b	[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963 )	2024-08-05 08:46:44 +00:00
Alphi	7b86e7c9cd	[Model] Add multi-image support for minicpmv (#7122 ) Co-authored-by: hezhihui <hzh7269@modelbest.cn> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-05 09:23:17 +08:00
Yihuan Bu	654bc5ca49	Support for guided decoding for offline LLM (#6878 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-04 03:12:09 +00:00
youkaichao	44dcb52e39	[ci][test] finalize fork_new_process_for_each_test (#7114 )	2024-08-03 10:44:53 -07:00
Jee Jee Li	99d7cabd7b	[LoRA] ReplicatedLinear support LoRA (#7081 )	2024-08-02 22:40:19 -07:00
Zach Zheng	fb2c1c86c1	[Bugfix] Fix block table for seqs that have prefix cache hits (#7018 )	2024-08-02 22:38:15 -07:00
youkaichao	a0d164567c	[ci][distributed] disable ray dag tests (#7099 )	2024-08-02 22:32:04 -07:00
youkaichao	04e5583425	[ci][distributed] merge distributed test commands (#7097 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-02 21:33:53 -07:00
youkaichao	69ea15e5cc	[ci][distributed] shorten wait time if server hangs (#7098 )	2024-08-02 21:05:16 -07:00
Robert Shaw	ed812a73fa	[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` (#6883 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-02 18:27:28 -07:00
Rui Qiao	05308891e2	[Core] Pipeline parallel with Ray ADAG (#6837 ) Support pipeline-parallelism with Ray accelerated DAG. Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-02 13:55:40 -07:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
youkaichao	806949514a	[ci] set timeout for test_oot_registration.py (#7082 )	2024-08-02 10:03:24 -07:00
youkaichao	252357793d	[ci][distributed] try to fix pp test (#7054 )	2024-08-01 22:03:12 -07:00
Woosuk Kwon	805a8a75f2	[Misc] Support attention logits soft-capping with flash-attn (#7022 )	2024-08-01 13:14:37 -07:00
Michael Goin	fb3db61688	[CI/Build] Remove sparseml requirement from testing (#7037 )	2024-08-01 12:00:51 -07:00
youkaichao	c8a7e93273	[core][scheduler] simplify and improve scheduler (#6867 )	2024-07-31 23:51:09 -07:00
zifeitong	3c10591ef2	[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954 )	2024-07-31 21:13:34 -07:00
Jee Jee Li	7ecee34321	[Kernel][RFC] Refactor the punica kernel based on Triton (#5036 )	2024-07-31 17:12:24 -07:00
Michael Goin	460c1884e3	[Bugfix] Support cpu offloading with fp8 quantization (#6960 )	2024-07-31 12:47:46 -07:00
Cody Yu	bd70013407	[MISC] Introduce pipeline parallelism partition strategies (#6920 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-07-31 12:02:17 -07:00
Cyrus Leung	daed30c4a9	[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982 )	2024-07-31 23:46:17 +08:00
HandH1998	6512937de1	Support W4A8 quantization for vllm (#5218 )	2024-07-31 07:55:21 -06:00
Cyrus Leung	f230cc2ca6	[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836 )	2024-07-31 10:38:45 +08:00
Tyler Michael Smith	d7a299edaa	[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842 )	2024-07-30 16:37:01 -04:00
Sanger Steel	052b6f8ca4	[Bugfix] Fix tensorizer memory profiling bug during testing (#6881 )	2024-07-30 11:48:50 -07:00
Nick Hill	5cf9254a9c	[BugFix] Fix use of per-request seed with pipeline parallel (#6698 )	2024-07-30 10:40:08 -07:00
Varun Sundar Rabindranath	af647fb8b3	[Kernel] Tuned int8 kernels for Ada Lovelace (#6848 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 20:24:58 -06:00
Nick Hill	9f69d8245a	[Frontend] New `allowed_token_ids` decoding request parameter (#6753 )	2024-07-29 23:37:27 +00:00
Thomas Parnell	9a7e2d0534	[Bugfix] Allow vllm to still work if triton is not installed. (#6786 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-29 14:51:27 -07:00
Peng Guanwen	db9e5708a9	[Core] Reduce unnecessary compute when logprobs=None (#6532 )	2024-07-29 16:47:31 +00:00
Varun Sundar Rabindranath	766435e660	[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 09:42:35 -06:00
Isotr0py	7cbd9ec7a9	[Model] Initialize support for InternVL2 series models (#6514 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-29 10:16:30 +00:00
Alexander Matveev	75acdaa4b6	[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795 )	2024-07-27 17:52:33 -04:00
Cyrus Leung	1ad86acf17	[Model] Initial support for BLIP-2 (#5920 ) Co-authored-by: ywang96 <ywang@roblox.com>	2024-07-27 11:53:07 +00:00
Joe	14dbd5a767	[Model] H2O Danube3-4b (#6451 )	2024-07-26 20:47:50 -07:00
Sanger Steel	969d032265	[Bugfix]: Fix Tensorizer test failures (#6835 )	2024-07-26 20:02:25 -07:00
youkaichao	443c7cf4cf	[ci][distributed] fix flaky tests (#6806 )	2024-07-25 17:44:09 -07:00
Michael Goin	65b1f121c8	[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761 )	2024-07-25 09:46:15 -07:00

1 2 3 4 5 ...

790 Commits