squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Simon Mo	62b5166bd4	[CI] Add ccache for wheel builds job (#4281 )	2024-04-23 09:51:41 -07:00
youkaichao	d86285a4a4	[Core][Logging] Add last frame information for better debugging (#4278 )	2024-04-23 09:45:52 -07:00
DefTruth	d87f39e9a9	[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286 )	2024-04-23 09:28:35 -07:00
Jack Gordley	d3c8180ac4	[Bugfix] Fixing max token error message for openai compatible server (#4016 )	2024-04-23 19:06:29 +08:00
Cade Daniel	62b8aebc6f	[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951 )	2024-04-23 08:02:36 +00:00
SangBin Cho	050f285ff6	[Core] Scheduling optimization 2 (#4280 )	2024-04-23 08:02:11 +00:00
Nick Hill	8f2ea22bde	[Core] Some simplification of WorkerWrapper changes (#4183 )	2024-04-23 07:49:08 +00:00
SangBin Cho	0ae11f78ab	[Mypy] Part 3 fix typing for nested directories for most of directory (#4161 )	2024-04-22 21:32:44 -07:00
Harry Mellor	34128a697e	Fix `autodoc` directives (#4272 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-23 01:53:01 +00:00
youkaichao	c1b4e4157c	[Core][Distributed] use absolute path for library file (#4271 )	2024-04-22 17:21:48 -07:00
Zhanghao Wu	ceaf4ed003	[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276 )	2024-04-22 15:34:31 -07:00
SangBin Cho	ad8d696a99	[Core] Scheduler perf fix (#4270 )	2024-04-22 21:11:06 +00:00
Harry Mellor	3d925165f2	Add example scripts to documentation (#4225 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-22 16:36:54 +00:00
alexm-nm	1543680691	[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217 )	2024-04-22 09:10:48 -07:00
Tao He	077f0a2e8a	[Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-04-22 09:19:51 +00:00
Woosuk Kwon	e73ed0f1c6	[Bugfix] Fix type annotations in CPU model runner (#4256 )	2024-04-22 00:54:16 -07:00
Isotr0py	296cdf8ac7	[Misc] Add vision language model support to CPU backend (#3968 )	2024-04-22 00:44:16 -07:00
youkaichao	747b1a7147	[Core][Distributed] fix _is_full_nvlink detection (#4233 )	2024-04-21 23:04:16 -07:00
Hongxia Yang	95e5b087cf	[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129 )	2024-04-21 21:57:24 -07:00
GeauxEric	a37d815b83	Make initialization of tokenizer and detokenizer optional (#3748 ) Co-authored-by: Yun Ding <yunding@nvidia.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-21 22:06:46 +00:00
xiaoji	7f2593b164	[Doc]: Update the doc of adding new models (#4236 )	2024-04-21 09:57:08 -07:00
Harry Mellor	fe7d648fe5	Don't show default value for flags in `EngineArgs` (#4223 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-21 09:15:28 -07:00
Noam Gat	cc74b2b232	Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222 )	2024-04-20 08:33:16 +00:00
nunjunj	91528575ec	[Frontend] multiple sampling params support (#3570 )	2024-04-20 00:11:57 -07:00
Cody Yu	a22cdea371	[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118 ) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.	2024-04-20 04:28:57 +00:00
Harry Mellor	682789d402	Fix missing docs and out of sync `EngineArgs` (#4219 ) Co-authored-by: Harry Mellor <hmellor@oxts.com>	2024-04-19 20:51:33 -07:00
Ayush Rautwar	138485a82d	[Bugfix] Add fix for JSON whitespace (#4189 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>	2024-04-19 20:49:22 -07:00
Chirag Jain	bc9df1571b	Pass `tokenizer_revision` when getting tokenizer in openai serving (#4214 )	2024-04-19 17:13:56 -07:00
youkaichao	15b86408a8	[Misc] add nccl in collect env (#4211 )	2024-04-19 19:44:51 +00:00
Ronen Schaffer	7be4f5628f	[Bugfix][Core] Restore logging of stats in the async engine (#4150 )	2024-04-19 08:08:26 -07:00
Uranus	8f20fc04bf	[Misc] fix docstrings (#4191 ) Co-authored-by: Zhong Wang <wangzhong@infini-ai.com>	2024-04-19 08:18:33 +00:00
Simon Mo	221d93ecbf	Bump version of 0.4.1 (#4177 )	2024-04-19 01:00:22 -07:00
Jee Li	d17c8477f1	[Bugfix] Fix LoRA loading check (#4138 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-04-19 00:59:54 -07:00
Simon Mo	a134ef6f5e	Support eos_token_id from generation_config.json (#4182 )	2024-04-19 04:13:36 +00:00
youkaichao	8a7a3e4436	[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-18 16:15:12 -07:00
Adam Tilghman	8f9c28fd40	[Bugfix] Fix CustomAllreduce nvlink topology detection (#3974 ) [Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) (#4159)	2024-04-18 15:32:47 -07:00
Liangfu Chen	cd2f63fb36	[CI/CD] add neuron docker and ci test scripts (#3571 )	2024-04-18 15:26:01 -07:00
Nick Hill	87fa80c91f	[Misc] Bump transformers to latest version (#4176 )	2024-04-18 14:36:39 -07:00
James Whedbee	e1bb2fd52d	[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149 )	2024-04-18 21:12:55 +00:00
Simon Mo	705578ae14	[Docs] document that Meta Llama 3 is supported (#4175 )	2024-04-18 10:55:48 -07:00
Michał Moskal	e8cc7967ff	[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128 )	2024-04-18 00:51:28 -07:00
Michael Goin	53b018edcb	[Bugfix] Get available quantization methods from quantization registry (#4098 )	2024-04-18 00:21:55 -07:00
Harry Mellor	66ded03067	Allow model to be served under multiple names (#2894 ) Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>	2024-04-18 00:16:26 -07:00
youkaichao	6dc1fc9cfe	[Core] nccl integrity check and test (#4155 ) [Core] Add integrity check during initialization; add test for it (#4155)	2024-04-17 22:28:52 -07:00
SangBin Cho	533d2a1f39	[Typing] Mypy typing part 2 (#4043 ) Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>	2024-04-17 17:28:43 -07:00
Shoichi Uchinami	a53222544c	[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134 )	2024-04-17 10:02:45 -07:00
Elinx	fe3b5bbc23	[Bugfix] fix output parsing error for trtllm backend (#4137 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-17 11:07:23 +00:00
youkaichao	8438e0569e	[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024 ) [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)	2024-04-17 08:34:33 +00:00
Cade Daniel	11d652bd4f	[CI] Move CPU/AMD tests to after wait (#4123 )	2024-04-16 22:53:26 -07:00

1 2 3 4 5 ...

1166 Commits