squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
zhaotyer	c2e00af523	[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable (#3955 ) Co-authored-by: tianyi_zhao <tianyi.zhao@transwarp.io>	2024-04-10 04:49:11 +00:00
Zedong Peng	c013d32c75	[Benchmark] Add cpu options to bench scripts (#3915 )	2024-04-09 21:30:03 -07:00
Jee Li	11dd6ebb89	[Misc] Avoid loading incorrect LoRA config (#3777 )	2024-04-09 19:47:15 -07:00
Juan Villamizar	6c0b04515f	[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm (#3643 ) Co-authored-by: jpvillam <jpvillam@amd.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-09 15:10:47 -07:00
Junichi Sato	e23a43aef8	[Bugfix] Fix KeyError on loading GPT-NeoX (#3925 )	2024-04-09 12:11:31 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
youkaichao	6d592eb430	[Core] separate distributed_init from worker (#3904 )	2024-04-09 08:49:02 +00:00
Roy	d036198e23	[BugFix][Model] Fix commandr RoPE max_position_embeddings (#3919 )	2024-04-09 06:17:21 +08:00
Matt Wong	59a6abf3c9	[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782 )	2024-04-08 14:31:02 -07:00
Kiran R	bc0c0192d1	[Bugfix] Enable Proper `attention_bias` Usage in Llama Model Configuration (#3767 ) Co-authored-by: roy <jasonailu87@gmail.com>	2024-04-08 19:42:35 +00:00
egortolmachev	f46864d68d	[Bugfix] Added Command-R GPTQ support (#3849 ) Co-authored-by: Egor Tolmachev <t333ga@gmail.com>	2024-04-08 14:59:38 +00:00
ywfang	b4543c8f6b	[Model] add minicpm (#3893 )	2024-04-08 18:28:36 +08:00
Isotr0py	0ce0539d47	[Bugfix] Fix Llava inference with Tensor Parallelism. (#3883 )	2024-04-07 22:54:13 +08:00
youkaichao	2f19283549	[Core] latency optimization (#3890 )	2024-04-06 19:14:06 -07:00
youkaichao	95baec828f	[Core] enable out-of-tree model register (#3871 )	2024-04-06 17:11:41 -07:00
youkaichao	e4be7d70bb	[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889 )	2024-04-06 21:32:30 +00:00
Isotr0py	54951ac4bf	[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869 )	2024-04-05 12:02:09 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
Thomas Parnell	1d7c940d74	Add option to completion API to truncate prompt tokens (#3144 )	2024-04-05 10:15:42 -07:00
Woosuk Kwon	cfaf49a167	[Misc] Define common requirements (#3841 )	2024-04-05 00:39:17 -07:00
Noam Gat	9edec652e2	[Bugfix] Fixing requirements.txt (#3865 )	2024-04-04 23:46:01 -07:00
Cade Daniel	e0dd4d3589	[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864 )	2024-04-04 21:57:33 -07:00
Cade Daniel	e5043a3e75	[Misc] Add pytest marker to opt-out of global test cleanup (#3863 )	2024-04-04 21:54:16 -07:00
youkaichao	d03d64fd2e	[CI/Build] refactor dockerfile & fix pip cache [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels (#3859)	2024-04-04 21:53:16 -07:00
Sean Gallen	78107fa091	[Doc]Add asynchronous engine arguments to documentation. (#3810 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-04-04 21:52:01 -07:00
youkaichao	c391e4b68e	[Core] improve robustness of pynccl (#3860 )	2024-04-04 16:52:12 -07:00
Saurabh Dash	9117f892f0	[Model] Cohere CommandR+ (#3829 )	2024-04-04 13:31:49 -07:00
Michael Goin	db2a6a41e2	[Hardware][CPU] Update cpu torch to match default of 2.2.1 (#3854 )	2024-04-04 19:49:49 +00:00
youkaichao	ca81ff5196	[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805 )	2024-04-04 10:26:19 -07:00
TianYu GUO	b7782002e1	[Benchmark] Refactor sample_requests in benchmark_throughput (#3613 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-04 09:56:22 +00:00
Chang Su	819a309c0f	[Bugfix] Fix args in benchmark_serving (#3836 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-04 07:41:05 +00:00
Matthias Gerstgrasser	aabe8f40f2	[Core] [Frontend] Make detokenization optional (#3749 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-04-03 21:52:18 -07:00
Woosuk Kwon	498eb5cfa3	[Bugfix] Add kv_scale input parameter to CPU backend (#3840 )	2024-04-04 04:33:08 +00:00
Michael Feil	537ee25f43	[Core] Enable hf_transfer by default if available (#3817 )	2024-04-04 04:02:43 +00:00
Tao He	294f8f6665	[BugFix] Pass tokenizer_config to local_tokenizer_group (#3754 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-04-03 20:31:46 -07:00
Woosuk Kwon	b95047f2da	[Misc] Publish 3rd meetup slides (#3835 )	2024-04-03 15:46:10 -07:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
SangBin Cho	3dcb3e8b98	[3/N] Refactor scheduler for chunked prefill scheduling (#3550 )	2024-04-03 14:13:49 -07:00
Michael Feil	c64cf38673	[Doc] Update contribution guidelines for better onboarding (#3819 )	2024-04-03 07:31:43 +00:00
Robert Shaw	76b889bf1d	[Doc] Update README.md (#3806 )	2024-04-02 23:11:10 -07:00
Nick Hill	c9b506dad4	[BugFix] Use different mechanism to get vllm version in `is_cpu()` (#3804 )	2024-04-02 23:06:25 -07:00
Cade Daniel	5757d90e26	[Speculative decoding] Adding configuration object for speculative decoding (#3706 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com>	2024-04-03 00:40:57 +00:00
youkaichao	a3c226e7eb	[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary (#3803 )	2024-04-02 12:57:04 -07:00
Michael Goin	b321d4881b	[Bugfix] Add `__init__.py` files for `vllm/core/block/` and `vllm/spec_decode/` (#3798 )	2024-04-02 12:35:31 -07:00
leiwen83	ad6eca408b	Fix early CUDA init via get_architecture_class_name import (#3770 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-04-02 11:56:26 -07:00
youkaichao	205b94942e	[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build (#3801 )	2024-04-02 11:54:33 -07:00
Roger Wang	3bec41f41a	[Doc] Fix vLLMEngine Doc Page (#3791 )	2024-04-02 09:49:37 -07:00
A-Mahla	0739b1947f	[Frontend][Bugfix] allow using the default middleware with a root path (#3788 ) Co-authored-by: A-Mahla <>	2024-04-02 01:20:28 -07:00
bigPYJ1151	77a6572aa5	[HotFix] [CI/Build] Minor fix for CPU backend CI (#3787 )	2024-04-01 22:50:53 -07:00
bigPYJ1151	0e3f06fe9c	[Hardware][Intel] Add CPU inference backend (#3634 ) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>	2024-04-01 22:07:30 -07:00

1 2 3 4 5 ...

1062 Commits