squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
zhaoyang-star	0650e5935b	Disable cuda version check in vllm-openai image (#4530 )	2024-05-05 16:58:55 -07:00
DearPlanet	4302987069	[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937 )	2024-05-04 15:39:34 -07:00
SangBin Cho	36fb68f947	[Doc] Chunked Prefill Documentation (#4580 )	2024-05-04 00:18:00 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
youkaichao	5b8a7c1cb0	[Misc] centralize all usage of environment variables (#4548 )	2024-05-02 11:13:25 -07:00
leiwen83	b38e42fbca	[Speculative decoding] Add ngram prompt lookup decoding (#4237 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-01 11:13:03 -07:00
AnyISalIn	a88bb9b032	[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173 ) Signed-off-by: AnyISalIn <anyisalin@gmail.com>	2024-05-01 09:11:03 -07:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
SangBin Cho	a88081bf76	[CI] Disable non-lazy string operation on logging (#4326 ) Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>	2024-04-26 00:16:58 -07:00
Caio Mendes	96e90fdeb3	[Model] Adds Phi-3 support (#4298 )	2024-04-25 03:06:57 +00:00
Cade Daniel	62b8aebc6f	[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951 )	2024-04-23 08:02:36 +00:00
GeauxEric	a37d815b83	Make initialization of tokenizer and detokenizer optional (#3748 ) Co-authored-by: Yun Ding <yunding@nvidia.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-21 22:06:46 +00:00
Michael Goin	53b018edcb	[Bugfix] Get available quantization methods from quantization registry (#4098 )	2024-04-18 00:21:55 -07:00
Antoni Baum	69e1d2fb69	[Core] Refactor model loading code (#4097 )	2024-04-16 11:34:39 -07:00
Noam Gat	05434764cd	LM Format Enforcer Guided Decoding Support (#3868 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-16 05:54:57 +00:00
Sanger Steel	711a000255	[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476 )	2024-04-13 17:13:01 -07:00
SangBin Cho	09473ee41c	[mypy] Add mypy type annotation part 1 (#4006 )	2024-04-12 14:35:50 -07:00
Jee Li	1096717ae9	[Core] Support LoRA on quantized models (#4012 )	2024-04-11 21:02:44 -07:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
Travis Johnson	934d3662f7	[Bugfix] handle hf_config with architectures == None (#3982 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-10 22:28:25 +00:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
Cade Daniel	5757d90e26	[Speculative decoding] Adding configuration object for speculative decoding (#3706 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com>	2024-04-03 00:40:57 +00:00
bigPYJ1151	0e3f06fe9c	[Hardware][Intel] Add CPU inference backend (#3634 ) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>	2024-04-01 22:07:30 -07:00
Qubitium	7d4e1b85e7	[Misc] Add support for new autogptq checkpoint_format (#3689 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-01 19:32:01 -04:00
Cade Daniel	93deb0b38f	[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250 )	2024-04-01 22:55:24 +00:00
Roger Wang	97356f3c7e	[Bugfix] Command-R Max Model Length (#3727 )	2024-03-29 12:27:51 -07:00
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Cade Daniel	14ccd94c89	[Core][Bugfix]Refactor block manager for better testability (#3492 )	2024-03-27 23:59:28 -07:00
Megha Agarwal	e24336b5a7	[Model] Add support for DBRX (#3660 )	2024-03-27 13:01:46 -07:00
xwjiang2010	64172a976c	[Feature] Add vision language model support. (#3042 )	2024-03-25 14:16:30 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Thomas Parnell	cf2f084d56	Dynamic scheduler delay to improve ITL performance (#3279 ) Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>	2024-03-22 12:28:14 -07:00
Hanzhi Zhou	f721096d48	[BugFix] Some fixes for custom allreduce kernels (#2760 )	2024-03-21 23:02:58 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
SangBin Cho	6e435de766	[1/n][Chunked Prefill] Refactor input query shapes (#3236 )	2024-03-20 14:46:05 -07:00
Nick Hill	7341c77d69	[BugFix] Avoid initializing CUDA too early (#3487 )	2024-03-18 23:05:20 -07:00
Antoni Baum	fb96c1e98c	Asynchronous tokenization (#2879 )	2024-03-15 23:37:01 +00:00
陈序	54be8a0be2	Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-03-14 13:56:57 -07:00
Bo-Wen Wang	b167109ba1	[Fix] Fix quantization="gptq" when using Marlin (#3319 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-03-12 22:51:42 -07:00
Zhuohan Li	4c922709b6	Add distributed model executor abstraction (#3191 )	2024-03-11 11:03:45 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Antoni Baum	22de45235c	Push logprob generation to LLMEngine (#3065 ) Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-03-04 19:54:06 +00:00
Philipp Moritz	17c3103c56	Make it easy to profile workers with nsight (#3162 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-03 16:19:13 -08:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
cloudhan	baee28c46c	Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104 )	2024-03-02 14:34:48 +08:00
Allen.Dou	29e70e3e88	allow user chose log level by --log-level instead of fixed 'info'. (#3109 ) Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-03-01 23:28:41 +00:00

1 2 3

115 Commits