squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Jiaxin Shan	42c7f66a38	[Core] Support dynamically loading Lora adapter from HuggingFace (#6234 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-22 15:42:40 -07:00
Swapnil Parekh	4d6ada947c	[CORE] Adding support for insertion of soft-tuned prompts (#4645 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com> Co-authored-by: Joe G <joseph.granados@h2o.ai> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-09 13:26:36 -07:00
youkaichao	482045ee77	[hardware][misc] introduce platform abstraction (#6080 )	2024-07-02 20:12:22 -07:00
Qubitium-ModelCloud	ee93f4f92a	[CORE] Quantized lm-head Framework (#4442 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: ZX <zx@lbx.dev>	2024-07-02 22:25:17 +00:00
youkaichao	614aa51203	[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007 )	2024-06-30 20:07:34 -07:00
SangBin Cho	f5e73c9f1b	[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909 ) Co-authored-by: sang <sangcho@anyscale.com>	2024-06-30 17:11:15 +00:00
Woosuk Kwon	79c92c7c8a	[Model] Add Gemma 2 (#5908 )	2024-06-27 13:33:56 -07:00
Cyrus Leung	96354d6a29	[Model] Add base class for LoRA-supported models (#5018 )	2024-06-27 16:03:04 +08:00
rohithkrn	f5dda63eb5	[LoRA] Add support for pinning lora adapters in the LRU cache (#5603 )	2024-06-21 15:42:46 -07:00
Jee Li	67005a07bc	[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-21 04:46:28 +00:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
Cyrus Leung	0bfa1c4f13	[Misc] Improve error message when LoRA parsing fails (#5194 )	2024-06-10 19:38:49 +08:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Antoni Baum	ccdc490dda	[Core] Change LoRA embedding sharding to support loading methods (#5038 )	2024-06-06 19:07:57 -07:00
Zhuohan Li	8279078e21	[Bugfix] Remove deprecated @abstractproperty (#5174 )	2024-06-01 22:40:25 +00:00
raywanb	97b030005c	[Model] LoRA gptbigcode implementation (#3949 )	2024-05-22 13:58:59 -07:00
SangBin Cho	c74c913bfb	[misc] remove comments that were supposed to be removed (#4977 )	2024-05-22 09:02:58 -04:00
SangBin Cho	2e9a2227ec	[Lora] Support long context lora (#4787 ) Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files	2024-05-18 16:05:23 +09:00
Antoni Baum	ad932a221d	[Core] Faster startup for LoRA enabled models (#4634 )	2024-05-08 10:33:18 -07:00
Austin Veselka	10760da800	[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (#4609 )	2024-05-07 10:59:07 -07:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
Cody Yu	a62aaf1df5	[Misc][Refactor] Generalize linear_method to be quant_method (#4373 )	2024-04-26 16:41:14 -04:00
SangBin Cho	a88081bf76	[CI] Disable non-lazy string operation on logging (#4326 ) Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>	2024-04-26 00:16:58 -07:00
SangBin Cho	b5b4a398a7	[Mypy] Typing lora folder (#4337 )	2024-04-25 19:13:50 +00:00
SangBin Cho	0ae11f78ab	[Mypy] Part 3 fix typing for nested directories for most of directory (#4161 )	2024-04-22 21:32:44 -07:00
Jee Li	d17c8477f1	[Bugfix] Fix LoRA loading check (#4138 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-04-19 00:59:54 -07:00
SangBin Cho	533d2a1f39	[Typing] Mypy typing part 2 (#4043 ) Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>	2024-04-17 17:28:43 -07:00
Jee Li	b8aacac31a	[Bugfix] Fix LoRA bug (#4032 )	2024-04-12 16:56:37 -07:00
Jee Li	1096717ae9	[Core] Support LoRA on quantized models (#4012 )	2024-04-11 21:02:44 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
youkaichao	63e7176f26	[Core][Refactor] move parallel_utils into vllm/distributed (#3950 ) [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)	2024-04-10 15:33:30 -07:00
Jee Li	11dd6ebb89	[Misc] Avoid loading incorrect LoRA config (#3777 )	2024-04-09 19:47:15 -07:00
Nick Hill	991143cfcd	[BugFix] Use consistent logger everywhere (#3738 )	2024-03-29 23:26:44 +00:00
Jee Li	8af890a865	Enable more models to inference based on LoRA (#3382 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-25 18:09:31 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
Roy	f1c0fc3919	Migrate `logits` computation and gather to `model_runner` (#3233 )	2024-03-20 23:25:01 +00:00
Nick Hill	4ad521d8b5	[Core] Add generic typing to `LRUCache` (#3511 )	2024-03-20 00:36:09 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Woosuk Kwon	d7afab6d3a	[BugFix] Fix GC bug for `LLM` class (#2882 )	2024-02-14 22:17:44 -08:00
Terry	2a543d6efe	Add LoRA support for Mixtral (#2831 ) * add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell	2024-02-14 00:55:45 +01:00
Philipp Moritz	390b495ff3	Don't build punica kernels by default (#2605 )	2024-01-26 15:19:19 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00

47 Commits