squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
leiwen83	4bb53e2dde	[BugFix] fix num_lookahead_slots missing in async executor (#4165 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-04-30 10:12:59 -07:00
Kunshang Ji	26f2fb5113	[Core]Refactor gptq_marlin ops (#4466 )	2024-04-30 08:14:47 -04:00
Woosuk Kwon	fa32207842	[Bugfix][Kernel] Fix compute_type for MoE kernel (#4463 )	2024-04-29 22:05:40 -07:00
Michael Goin	d627a3d837	[Misc] Upgrade to `torch==2.3.0` (#4454 )	2024-04-29 20:05:47 -04:00
youkaichao	f4f921b7f1	[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444 )	2024-04-29 13:52:22 -07:00
Simon Mo	ac5ccf0156	[CI] hotfix: soft fail neuron test (#4458 )	2024-04-29 19:50:01 +00:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
SangBin Cho	df29793dc7	[mypy][5/N] Support all typing on model executor (#4427 )	2024-04-28 19:01:26 -07:00
Simon Mo	03dd7d52bf	[CI] clean docker cache for neuron (#4441 )	2024-04-28 23:32:07 +00:00
Ronen Schaffer	bf480c5302	Add more Prometheus metrics (#2764 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-28 15:59:33 -07:00
DefTruth	9c7306ac11	[Misc] fix typo in llm_engine init logging (#4428 )	2024-04-28 18:58:30 +08:00
Robert Shaw	4ea1f9678d	[BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418 )	2024-04-27 18:35:33 +00:00
Nick Hill	ba4be44c32	[BugFix] Fix return type of executor execute_model methods (#4402 )	2024-04-27 11:17:45 -07:00
Prashant Gupta	d6e520e170	[Core] Support offline use of local cache for models (#4374 ) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>	2024-04-27 09:59:55 -07:00
Nick Hill	81661da7b2	[BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389 ) Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>	2024-04-27 09:52:46 -07:00
Ruoyu Qin	dfea173148	[Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363 )	2024-04-27 09:48:37 -07:00
Roy	7134303cbb	[Bugfix][Core] Fix get decoding config from ray (#4335 )	2024-04-27 11:30:08 +00:00
Caio Mendes	3da24c2df7	[Model] Phi-3 4k sliding window temp. fix (#4380 )	2024-04-27 18:08:15 +08:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
Hongxia Yang	18d23f642a	[ROCm][Hardware][AMD] Enable group query attention for triton FA (#4406 )	2024-04-26 23:37:40 -07:00
Roy	87f545ba6f	[Misc] Fix logger format typo (#4396 )	2024-04-27 13:45:02 +08:00
Cyrus Leung	8947bc3c15	[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355 )	2024-04-27 05:08:24 +00:00
Philipp Moritz	12628d3c78	[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-27 04:49:59 +00:00
Nick Hill	258a2c58d0	[Core] Introduce `DistributedGPUExecutor` abstract class (#4348 )	2024-04-27 04:14:26 +00:00
youkaichao	aba47be3fe	[Misc] add RFC issue template (#4401 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-26 15:47:45 -07:00
Cody Yu	a62aaf1df5	[Misc][Refactor] Generalize linear_method to be quant_method (#4373 )	2024-04-26 16:41:14 -04:00
SangBin Cho	603ad84815	[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309 )	2024-04-26 13:02:02 +00:00
SangBin Cho	a88081bf76	[CI] Disable non-lazy string operation on logging (#4326 ) Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>	2024-04-26 00:16:58 -07:00
Norman Mu	2f30e7c72f	[Frontend] Add --log-level option to api server (#4377 )	2024-04-26 05:36:01 +00:00
Cyrus Leung	a74dee9b62	[Bugfix] Fix parameter name in `get_tokenizer` (#4107 )	2024-04-25 19:10:48 -07:00
Hongxia Yang	cf29b7eda4	[ROCm][Hardware][AMD][Doc] Documentation update for ROCm (#4376 ) Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu>	2024-04-25 18:12:25 -07:00
Nick Hill	efffb63f58	[Core] Move function tracing setup to util function (#4352 )	2024-04-25 16:45:12 -07:00
Nick Hill	15e7c675b0	[Core] Add `shutdown()` method to `ExecutorBase` (#4349 )	2024-04-25 16:32:48 -07:00
Roy	b6dcb4d442	[Misc] Fix flash attention backend log (#4368 )	2024-04-25 12:43:32 -07:00
SangBin Cho	b5b4a398a7	[Mypy] Typing lora folder (#4337 )	2024-04-25 19:13:50 +00:00
Kunshang Ji	f4bc4de1b1	[Core]refactor aqlm quant ops (#4351 )	2024-04-25 15:03:56 -04:00
Caio Mendes	bd7a8eef25	[Doc] README Phi-3 name fix. (#4372 ) Co-authored-by: Caio Mendes <caiocesart@microsoft.com>	2024-04-25 10:32:00 -07:00
Alexei-V-Ivanov-AMD	7ee82bef1e	[CI/Build] Adding functionality to reset the node's GPUs before processing. (#4213 )	2024-04-25 09:37:20 -07:00
Isotr0py	fbf152d976	[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 (#4324 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-25 09:35:56 -07:00
Nick Hill	479d69fad0	[Core] Move ray_utils.py from `engine` to `executor` package (#4347 )	2024-04-25 06:52:22 +00:00
Caio Mendes	96e90fdeb3	[Model] Adds Phi-3 support (#4298 )	2024-04-25 03:06:57 +00:00
zifeitong	a395a638c2	[Misc] Use public API in benchmark_throughput (#4300 )	2024-04-24 21:10:24 +00:00
youkaichao	2768884ac4	[Doc] Add note for docker user (#4340 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-24 21:09:44 +00:00
alexm-nm	aae08249ac	[Bugfix] Fix marlin kernel crash on H100 (#4218 ) This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.	2024-04-24 10:35:01 -07:00
Roger Wang	7923dcad12	[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279 )	2024-04-24 09:49:13 -07:00
youkaichao	3cd9b5bb2d	[Core][Distributed] use existing torch.cuda.device (#4318 ) [Core][Distributed] use existing torch.cuda.device context manager (#4318)	2024-04-24 09:00:20 -07:00
Woosuk Kwon	468d761b32	[Misc] Reduce supported Punica dtypes (#4304 )	2024-04-23 18:54:33 -07:00
youkaichao	e4bf860a54	[CI][Build] change pynvml to nvidia-ml-py (#4302 )	2024-04-23 18:33:12 -07:00
youkaichao	91f50a6fe2	[Core][Distributed] use cpu/gloo to initialize pynccl (#4248 )	2024-04-23 18:32:19 -07:00
Robert Shaw	79a268c4ab	[BUG] fixed fp8 conflict with aqlm (#4307 ) Fixes fp8 iterface which broke in AQLM merge.	2024-04-23 18:26:33 -07:00

1 2 3 4 5 ...

1218 Commits