James Fleming
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Simon Mo
62b5166bd4
[CI] Add ccache for wheel builds job ( #4281 )
2024-04-23 09:51:41 -07:00
youkaichao
d86285a4a4
[Core][Logging] Add last frame information for better debugging ( #4278 )
2024-04-23 09:45:52 -07:00
DefTruth
d87f39e9a9
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper ( #4286 )
2024-04-23 09:28:35 -07:00
Jack Gordley
d3c8180ac4
[Bugfix] Fixing max token error message for openai compatible server ( #4016 )
2024-04-23 19:06:29 +08:00
Cade Daniel
62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. ( #3951 )
2024-04-23 08:02:36 +00:00
SangBin Cho
050f285ff6
[Core] Scheduling optimization 2 ( #4280 )
2024-04-23 08:02:11 +00:00
Nick Hill
8f2ea22bde
[Core] Some simplification of WorkerWrapper changes ( #4183 )
2024-04-23 07:49:08 +00:00
SangBin Cho
0ae11f78ab
[Mypy] Part 3 fix typing for nested directories for most of directory ( #4161 )
2024-04-22 21:32:44 -07:00
Harry Mellor
34128a697e
Fix autodoc directives ( #4272 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-23 01:53:01 +00:00
youkaichao
c1b4e4157c
[Core][Distributed] use absolute path for library file ( #4271 )
2024-04-22 17:21:48 -07:00
Zhanghao Wu
ceaf4ed003
[Doc] Update the SkyPilot doc with serving and Llama-3 ( #4276 )
2024-04-22 15:34:31 -07:00
SangBin Cho
ad8d696a99
[Core] Scheduler perf fix ( #4270 )
2024-04-22 21:11:06 +00:00
Harry Mellor
3d925165f2
Add example scripts to documentation ( #4225 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-22 16:36:54 +00:00
alexm-nm
1543680691
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter ( #4217 )
2024-04-22 09:10:48 -07:00
Tao He
077f0a2e8a
[Frontend] Enable support for CPU backend in AsyncLLMEngine. ( #3993 )
...
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-04-22 09:19:51 +00:00
Woosuk Kwon
e73ed0f1c6
[Bugfix] Fix type annotations in CPU model runner ( #4256 )
2024-04-22 00:54:16 -07:00
Isotr0py
296cdf8ac7
[Misc] Add vision language model support to CPU backend ( #3968 )
2024-04-22 00:44:16 -07:00
youkaichao
747b1a7147
[Core][Distributed] fix _is_full_nvlink detection ( #4233 )
2024-04-21 23:04:16 -07:00
Hongxia Yang
95e5b087cf
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring ( #4129 )
2024-04-21 21:57:24 -07:00
GeauxEric
a37d815b83
Make initialization of tokenizer and detokenizer optional ( #3748 )
...
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
xiaoji
7f2593b164
[Doc]: Update the doc of adding new models ( #4236 )
2024-04-21 09:57:08 -07:00
Harry Mellor
fe7d648fe5
Don't show default value for flags in EngineArgs ( #4223 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-21 09:15:28 -07:00
Noam Gat
cc74b2b232
Updating lm-format-enforcer version and adding links to decoding libraries in docs ( #4222 )
2024-04-20 08:33:16 +00:00
nunjunj
91528575ec
[Frontend] multiple sampling params support ( #3570 )
2024-04-20 00:11:57 -07:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Harry Mellor
682789d402
Fix missing docs and out of sync EngineArgs ( #4219 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-19 20:51:33 -07:00
Ayush Rautwar
138485a82d
[Bugfix] Add fix for JSON whitespace ( #4189 )
...
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
2024-04-19 20:49:22 -07:00
Chirag Jain
bc9df1571b
Pass tokenizer_revision when getting tokenizer in openai serving ( #4214 )
2024-04-19 17:13:56 -07:00
youkaichao
15b86408a8
[Misc] add nccl in collect env ( #4211 )
2024-04-19 19:44:51 +00:00
Ronen Schaffer
7be4f5628f
[Bugfix][Core] Restore logging of stats in the async engine ( #4150 )
2024-04-19 08:08:26 -07:00
Uranus
8f20fc04bf
[Misc] fix docstrings ( #4191 )
...
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com>
2024-04-19 08:18:33 +00:00
Simon Mo
221d93ecbf
Bump version of 0.4.1 ( #4177 )
2024-04-19 01:00:22 -07:00
Jee Li
d17c8477f1
[Bugfix] Fix LoRA loading check ( #4138 )
...
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00
Simon Mo
a134ef6f5e
Support eos_token_id from generation_config.json ( #4182 )
2024-04-19 04:13:36 +00:00
youkaichao
8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference ( #4079 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-18 16:15:12 -07:00
Adam Tilghman
8f9c28fd40
[Bugfix] Fix CustomAllreduce nvlink topology detection ( #3974 )
...
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974 ) (#4159 )
2024-04-18 15:32:47 -07:00
Liangfu Chen
cd2f63fb36
[CI/CD] add neuron docker and ci test scripts ( #3571 )
2024-04-18 15:26:01 -07:00
Nick Hill
87fa80c91f
[Misc] Bump transformers to latest version ( #4176 )
2024-04-18 14:36:39 -07:00
James Whedbee
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields ( #4149 )
2024-04-18 21:12:55 +00:00
Simon Mo
705578ae14
[Docs] document that Meta Llama 3 is supported ( #4175 )
2024-04-18 10:55:48 -07:00
Michał Moskal
e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill ( #4128 )
2024-04-18 00:51:28 -07:00
Michael Goin
53b018edcb
[Bugfix] Get available quantization methods from quantization registry ( #4098 )
2024-04-18 00:21:55 -07:00
Harry Mellor
66ded03067
Allow model to be served under multiple names ( #2894 )
...
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
2024-04-18 00:16:26 -07:00
youkaichao
6dc1fc9cfe
[Core] nccl integrity check and test ( #4155 )
...
[Core] Add integrity check during initialization; add test for it (#4155 )
2024-04-17 22:28:52 -07:00
SangBin Cho
533d2a1f39
[Typing] Mypy typing part 2 ( #4043 )
...
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
2024-04-17 17:28:43 -07:00
Shoichi Uchinami
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA ( #4134 )
2024-04-17 10:02:45 -07:00
Elinx
fe3b5bbc23
[Bugfix] fix output parsing error for trtllm backend ( #4137 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-17 11:07:23 +00:00
youkaichao
8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication ( #4024 )
...
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024 )
2024-04-17 08:34:33 +00:00
Cade Daniel
11d652bd4f
[CI] Move CPU/AMD tests to after wait ( #4123 )
2024-04-16 22:53:26 -07:00