squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Michael Goin	9e0b558a09	[Misc] Support FP8 kv cache scales from compressed-tensors (#6528 )	2024-07-23 04:11:50 +00:00
zhaotyer	e519ae097a	add tqdm when loading checkpoint shards (#6569 ) Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io> Co-authored-by: youkaichao <youkaichao@126.com>	2024-07-22 20:48:01 -07:00
youkaichao	7c2749a4fd	[misc] add start loading models for users information (#6670 )	2024-07-22 20:08:02 -07:00
Woosuk Kwon	729171ae58	[Misc] Enable chunked prefill by default for long context models (#6666 )	2024-07-22 20:03:13 -07:00
Cheng Li	c5e8330997	[Bugfix] Fix null `modules_to_not_convert` in FBGEMM Fp8 quantization (#6665 )	2024-07-22 19:25:05 -07:00
Cody Yu	e0c15758b8	[Core] Modulize prepare input and attention metadata builder (#6596 )	2024-07-23 00:45:24 +00:00
Woosuk Kwon	bdf5fd1386	[Misc] Remove deprecation warning for beam search (#6659 )	2024-07-23 00:21:58 +00:00
youkaichao	5a96ee52a3	[ci][build] add back vim in docker (#6661 )	2024-07-22 16:26:29 -07:00
Jiaxin Shan	42c7f66a38	[Core] Support dynamically loading Lora adapter from HuggingFace (#6234 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-22 15:42:40 -07:00
Kevin H. Luu	69d5ae38dc	[ci] Use different sccache bucket for CUDA 11.8 wheel build (#6656 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-07-22 14:20:41 -07:00
Tyler Michael Smith	fea59c7712	[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649 )	2024-07-22 14:08:30 -06:00
Cyrus Leung	739b61a348	[Frontend] Refactor prompt processing (#4028 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-22 10:13:53 -07:00
Jae-Won Chung	89c1c6a196	[Bugfix] Fix `vocab_size` field access in `llava_next.py` (#6624 )	2024-07-22 05:02:51 +00:00
Woosuk Kwon	42de2cefcb	[Misc] Add a wrapper for torch.inference_mode (#6618 )	2024-07-21 18:43:11 -07:00
Roger Wang	c9eef37f32	[Model] Initial Support for Chameleon (#5770 )	2024-07-21 17:37:51 -07:00
Alexander Matveev	396d92d5e0	[Kernel][Core] Add AWQ support to the Marlin kernel (#6612 )	2024-07-21 19:41:42 -04:00
Isotr0py	25e778aa16	[Model] Refactor and decouple phi3v image embedding (#6621 )	2024-07-21 16:07:58 -07:00
Woosuk Kwon	b6df37f943	[Misc] Remove abused noqa (#6619 )	2024-07-21 23:47:04 +08:00
sroy745	14f91fe67c	[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485 )	2024-07-20 23:58:58 -07:00
Cyrus Leung	d7f4178dd9	[Frontend] Move chat utils (#6602 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-21 08:38:17 +08:00
Robert Shaw	082ecd80d5	[ Bugfix ] Fix AutoFP8 fp8 marlin (#6609 )	2024-07-20 17:25:56 -06:00
Michael Goin	f952bbc8ff	[Misc] Fix input_scale typing in w8a8_utils.py (#6579 )	2024-07-20 23:11:13 +00:00
Robert Shaw	9364f74eee	[ Kernel ] Enable `fp8-marlin` for `fbgemm-fp8` models (#6606 )	2024-07-20 18:50:10 +00:00
Matt Wong	06d6c5fe9f	[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543 )	2024-07-20 09:39:07 -07:00
Robert Shaw	683e3cb9c4	[ Misc ] `fbgemm` checkpoints (#6559 )	2024-07-20 09:36:57 -07:00
Cyrus Leung	9042d68362	[Misc] Consolidate and optimize logic for building padded tensors (#6541 )	2024-07-20 04:17:24 +00:00
Travis Johnson	3f8d42c81f	Pipeline Parallel: Guard for KeyErrors at request abort (#6587 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-07-19 19:18:19 -07:00
Antoni Baum	7bd82002ae	[Core] Allow specifying custom Executor (#6557 )	2024-07-20 01:25:06 +00:00
Varun Sundar Rabindranath	2e26564259	[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593 ) Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>	2024-07-19 18:15:26 -07:00
youkaichao	e81522e879	[build] add ib in image for out-of-the-box infiniband support (#6599 ) [build] add ib so that multi-node support with infiniband can be supported out-of-the-box (#6599)	2024-07-19 17:16:57 -07:00
Murali Andoorveedu	45ceb85a0c	[Docs] Update PP docs (#6598 )	2024-07-19 16:38:21 -07:00
Robert Shaw	4cc24f01b1	[ Kernel ] Enable Dynamic Per Token `fp8` (#6547 )	2024-07-19 23:08:15 +00:00
youkaichao	07eb6f19f3	[bugfix][distributed] fix multi-node bug for shared memory (#6597 )	2024-07-19 15:34:34 -07:00
Thomas Parnell	f0bbfaf917	[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578 )	2024-07-19 14:01:03 -07:00
Simon Mo	30efe41532	[Docs] Update docs for wheel location (#6580 )	2024-07-19 12:14:11 -07:00
Antoni Baum	9ed82e7074	[Misc] Small perf improvements (#6520 )	2024-07-19 12:10:56 -07:00
Daniele	51f8aa90ad	[Bugfix][Frontend] remove duplicate init logger (#6581 )	2024-07-19 10:16:27 -07:00
Thomas Parnell	a5314e8698	[Model] RowParallelLinear: pass bias to quant_method.apply (#6327 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-19 07:15:22 -06:00
Woo-Yeon Lee	a921e86392	[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369 )	2024-07-19 06:01:09 -07:00
Cyrus Leung	6366efc67b	[Bugfix][Frontend] Fix missing `/metrics` endpoint (#6463 )	2024-07-19 03:55:13 +00:00
Robert Shaw	dbe5588554	[ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515 )	2024-07-18 22:39:18 -04:00
Thomas Parnell	d4201e06d5	[Bugfix] Make spec. decode respect per-request seed. (#6034 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-07-18 19:22:08 -07:00
Nick Hill	b5672a112c	[Core] Multiprocessing Pipeline Parallel support (#6130 ) Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-18 19:15:52 -07:00
Simon Mo	c5df56f88b	Add support for a rope extension method (#6553 )	2024-07-19 01:53:03 +00:00
Tyler Michael Smith	1689219ebf	[CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517 )	2024-07-18 17:29:25 -07:00
Tyler Michael Smith	4ffffccb7e	[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552 )	2024-07-18 23:52:22 +00:00
youkaichao	f53b8f0d05	[ci][test] add correctness test for cpu offloading (#6549 )	2024-07-18 23:41:06 +00:00
Kevin H. Luu	2d4733ba2d	Fix PR comment bot (#6554 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-07-18 14:48:29 -07:00
Michael Goin	15c6a079b1	[Model] Support Mistral-Nemo (#6548 )	2024-07-18 20:31:50 +00:00
Kevin H. Luu	ecdb462c24	[ci] Reword Github bot comment (#6534 )	2024-07-18 08:01:45 -07:00

... 2 3 4 5 6 ...

2172 Commits