squall/vllm - vllm - Gitea: Git with a cup of tea

squall/vllm

Fork 0

Commit Graph

Author	SHA1	Message	Date
Antoni Baum	ccdc490dda	[Core] Change LoRA embedding sharding to support loading methods (#5038 )	2024-06-06 19:07:57 -07:00
Cyrus Leung	5ae5ed1e60	[Core] Consolidate prompt arguments to LLM engines (#4328 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-05-28 13:29:31 -07:00
SangBin Cho	2e9a2227ec	[Lora] Support long context lora (#4787 ) Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files	2024-05-18 16:05:23 +09:00

Author

SHA1

Message

Date

Antoni Baum

ccdc490dda

[Core] Change LoRA embedding sharding to support loading methods (#5038 )

2024-06-06 19:07:57 -07:00

Cyrus Leung

5ae5ed1e60

[Core] Consolidate prompt arguments to LLM engines (#4328 )

Co-authored-by: Roger Wang <ywang@roblox.com>

2024-05-28 13:29:31 -07:00

SangBin Cho

2e9a2227ec

[Lora] Support long context lora (#4787 )

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2024-05-18 16:05:23 +09:00

3 Commits