squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Antoni Baum	999ef0b917	[Misc] Add numpy implementation of `compute_slot_mapping` (#7377 )	2024-08-09 22:52:29 +00:00
Dipika Sikka	5c6c54d67a	[Bugfix] Fix `PerTensorScaleParameter` weight loading for fused models (#7376 )	2024-08-09 21:23:46 +00:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
Roger Wang	70d268a399	[Bugfix] Fix ITL recording in serving benchmark (#7372 )	2024-08-09 10:00:00 -07:00
Pooya Davoodi	249b88228d	[Frontend] Support embeddings in the run_batch API (#7132 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-09 09:48:21 -07:00
Alexander Matveev	74af2bbd90	[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360 )	2024-08-09 16:35:49 +00:00
Alexander Matveev	fc7b8d1eef	[Performance] e2e overheads reduction: Small followup diff (#7364 )	2024-08-09 15:49:36 +00:00
Isotr0py	67abdbb42f	[VLM][Doc] Add `stop_token_ids` to InternVL example (#7354 )	2024-08-09 14:51:04 +00:00
Mor Zusman	07ab160741	[Model][Jamba] Mamba cache single buffer (#6739 ) Co-authored-by: Mor Zusman <morz@ai21.com>	2024-08-09 10:07:06 -04:00
Nick Hill	b4e9528f95	[Core] Streamline stream termination in `AsyncLLMEngine` (#7336 )	2024-08-09 07:06:36 +00:00
William Lin	57b7be0e1c	[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971 )	2024-08-09 05:42:45 +00:00
Travis Johnson	99b4cf5f23	[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-08-08 22:08:46 -07:00
Alexander Matveev	e02ac55617	[Performance] Optimize e2e overheads: Reduce python allocations (#7162 )	2024-08-08 21:34:28 -07:00
Woosuk Kwon	73388c07a4	[TPU] Fix dockerfile.tpu (#7331 )	2024-08-08 20:24:58 -07:00
Cyrus Leung	7eb4a51c5f	[Core] Support serving encoder/decoder models (#7258 )	2024-08-09 10:39:41 +08:00
Siyuan Liu	0fa14907da	[TPU] Add Load-time W8A16 quantization for TPU Backend (#7005 )	2024-08-08 18:35:49 -07:00
Simon Mo	5923532e15	Add Skywork AI as Sponsor (#7314 )	2024-08-08 13:59:57 -07:00
Jee Jee Li	a049b107e2	[Misc] Temporarily resolve the error of BitAndBytes (#7308 )	2024-08-08 13:42:58 -07:00
Isotr0py	8334c39f37	[Bugfix] Fix new Llama3.1 GGUF model loading (#7269 )	2024-08-08 13:42:44 -07:00
Daniele	e904576743	[CI/Build] Dockerfile.cpu improvements (#7298 )	2024-08-08 15:24:52 -04:00
Michael Goin	e14fb22e59	[Doc] Put collect_env issue output in a <detail> block (#7310 )	2024-08-08 11:22:49 -07:00
Zach Zheng	782e53ab59	[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849 )	2024-08-08 10:43:30 -07:00
Joe Runde	21b9c49aa3	[Frontend] Kill the server on engine death (#6594 ) Signed-off-by: Joe Runde <joe@joerun.de> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-08-08 09:47:48 -07:00
Luka Govedič	5fb4a3f678	[Bugfix][Kernel] Increased atol to fix failing tests (#7305 )	2024-08-08 12:16:13 -04:00
Jee Jee Li	757ac70a64	[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 (#7273 )	2024-08-08 14:02:41 +00:00
Murali Andoorveedu	6dffa4b0a6	[Bugfix] Fix LoRA with PP (#7292 )	2024-08-08 00:02:27 -07:00
Cherilyn Buren	48abee9e54	[Frontend] remove max_num_batched_tokens limit for lora (#7288 )	2024-08-08 06:17:29 +00:00
Rui Qiao	746709642c	[Misc] Fix typos in scheduler.py (#7285 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-07 17:06:01 -07:00
Lily Liu	e53dfd3eaf	[Kernel] Fix Flashinfer Correctness (#7284 )	2024-08-07 16:26:52 -07:00
Michael Goin	6d94420246	[Doc] Update supported_hardware.rst (#7276 )	2024-08-07 14:21:50 -07:00
Nick Hill	fc1493a01e	[FrontEnd] Make `merge_async_iterators` `is_cancelled` arg optional (#7282 )	2024-08-07 13:35:14 -07:00
Lucas Wilkinson	311f743831	[Bugfix] Fix gptq failure on T4s (#7264 )	2024-08-07 20:05:37 +00:00
Kevin H. Luu	469b3bc538	[ci] Make building wheels per commit optional (#7278 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-08-07 11:34:25 -07:00
Michael Goin	5223199e03	[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219 )	2024-08-07 11:23:12 -07:00
Maximilien de Bayser	fde47d3bc2	[BugFix] Fix frontend multiprocessing hang (#7217 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-08-07 18:09:36 +00:00
Stas Bekman	0e12cd67a8	[Doc] add online speculative decoding example (#7243 )	2024-08-07 09:58:02 -07:00
Ilya Lavrenov	80cbe10c59	[OpenVINO] migrate to latest dependencies versions (#7251 )	2024-08-07 09:49:10 -07:00
Isotr0py	b764547616	[Bugfix] Fix input processor for InternVL2 model (#7164 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-07 09:32:07 -07:00
Rafael Vasquez	ab0f5e2823	Fixes typo in function name (#7275 ) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>	2024-08-07 09:29:27 -07:00
Robert Shaw	564985729a	[ BugFix ] Move `zmq` frontend to IPC instead of TCP (#7222 )	2024-08-07 16:24:56 +00:00
Dipika Sikka	0f7052bc7e	[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874 )	2024-08-07 09:17:58 -07:00
youkaichao	639159b2a6	[distributed][misc] add specialized method for cuda platform (#7249 )	2024-08-07 08:54:52 -07:00
Cyrus Leung	66d617e343	[Frontend] Gracefully handle missing chat template and fix CI failure (#7238 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-08-07 09:12:05 +00:00
Atilla Akkuş	7b261092de	[BUGFIX]: top_k is expected to be an integer. (#7227 )	2024-08-07 00:32:16 -07:00
Roger Wang	2385c8f374	[Doc] Mock new dependencies for documentation (#7245 )	2024-08-07 06:43:03 +00:00
Nick Hill	9a3f49ae07	[BugFix] Overhaul async request cancellation (#7111 )	2024-08-07 13:21:41 +08:00
Michael Goin	f9a5600649	[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225 )	2024-08-06 18:34:26 -07:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
xiaobochen123	660470e5a3	[Core] Optimize evictor-v2 performance (#7193 )	2024-08-06 12:34:25 -07:00
Luka Govedič	8d59dbb000	[Kernel] Add per-tensor and per-token AZP epilogues (#5941 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-08-06 18:17:08 +00:00

... 2 3 4 5 6 ...

2417 Commits