Matt Wong
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes ( #5422 )
2024-06-25 15:56:15 -07:00
Dipika Sikka
dd248f7675
[Misc] Update w4a16 compressed-tensors support to include w8a16 ( #5794 )
2024-06-25 19:23:35 +00:00
Michael Goin
d9b34baedd
[CI/Build] Add unit testing for FlexibleArgumentParser ( #5798 )
2024-06-25 12:18:03 -07:00
Antoni Baum
67882dbb44
[Core] Add fault tolerance for RayTokenizerGroupPool ( #5748 )
2024-06-25 10:15:10 -07:00
Woo-Yeon Lee
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model ( #5414 )
2024-06-25 09:56:06 +00:00
Isotr0py
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement ( #5772 )
2024-06-24 12:11:53 +08:00
Murali Andoorveedu
5d4d90536f
[Distributed] Add send and recv helpers ( #5719 )
2024-06-23 14:42:28 -07:00
rohithkrn
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache ( #5603 )
2024-06-21 15:42:46 -07:00
youkaichao
d9a252bc8e
[Core][Distributed] add shm broadcast ( #5399 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-21 05:12:35 +00:00
Jee Li
67005a07bc
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora ( #5665 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-21 04:46:28 +00:00
Chang Su
c35e4a3dd7
[BugFix] Fix test_phi3v.py ( #5725 )
2024-06-21 04:45:34 +00:00
Jinzhen Lin
1f5674218f
[Kernel] Add punica dimension for Qwen2 LoRA ( #5441 )
2024-06-20 17:55:41 -07:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
Michael Goin
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
Cyrus Leung
3730a1c832
[Misc] Improve conftest ( #5681 )
2024-06-19 19:09:21 -07:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
zifeitong
78687504f7
[Bugfix] AsyncLLMEngine hangs with asyncio.run ( #5654 )
2024-06-19 13:57:12 -07:00
youkaichao
d571ca0108
[ci][distributed] add tests for custom allreduce ( #5689 )
2024-06-19 20:16:04 +00:00
Thomas Parnell
e5150f2c28
[Bugfix] Added test for sampling repetition penalty bug. ( #5659 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-19 06:03:55 +00:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
Ronen Schaffer
7879f24dcc
[Misc] Add OpenTelemetry support ( #4687 )
...
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.
I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Roger Wang
4ad7b53e59
[CI/Build][Misc] Update Pytest Marker for VLMs ( #5623 )
2024-06-18 13:10:04 +00:00
Joe Runde
5002175e80
[Kernel] Add punica dimensions for Granite 13b ( #5559 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-06-18 03:54:11 +00:00
Isotr0py
daef218b55
[Model] Initialize Phi-3-vision support ( #4986 )
2024-06-17 19:34:33 -07:00
sroy745
fa9e385229
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier ( #5131 )
2024-06-17 21:29:09 -05:00
Dipika Sikka
890d8d960b
[Kernel] compressed-tensors marlin 24 support ( #5435 )
2024-06-17 12:32:48 -04:00
Michael Goin
4a6769053a
[CI][BugFix] Flip is_quant_method_supported condition ( #5577 )
2024-06-16 14:07:34 +00:00
Alexander Matveev
d919ecc771
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 ( #5145 )
2024-06-15 13:38:16 -04:00
Cyrus Leung
81fbb3655f
[CI/Build] Test both text and token IDs in batched OpenAI Completions API ( #5568 )
2024-06-15 07:29:42 -04:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
leiwen83
1b8a0d71cf
[Core][Bugfix]: fix prefix caching for blockv2 ( #5364 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-06-14 17:23:56 -07:00
youkaichao
48f589e18b
[mis] fix flaky test of test_cuda_device_count_stateless ( #5546 )
2024-06-14 10:02:23 -07:00
Antoni Baum
50eed24d25
Add cuda_device_count_stateless ( #5473 )
2024-06-13 16:06:49 -07:00
Tyler Michael Smith
33e3b37242
[CI/Build] Disable test_fp8.py ( #5508 )
2024-06-13 13:37:48 -07:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Cyrus Leung
39873476f8
[CI/Build] Simplify OpenAI server setup in tests ( #5100 )
2024-06-13 11:21:53 -07:00
Michael Goin
23ec72fa03
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations ( #5466 )
2024-06-13 15:18:08 +00:00
Dipika Sikka
c2637a613b
[Kernel] w4a16 support for compressed-tensors ( #5385 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 10:19:56 -04:00
youkaichao
ea3890a5f0
[Core][Distributed] code deduplication in tp&pp with coordinator( #5293 )
...
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293 )
2024-06-12 17:27:08 -07:00
Travis Johnson
51602eefd3
[Frontend] [Core] Support for sharded tensorized models ( #4990 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-12 14:13:52 -07:00
Cody Yu
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
SangBin Cho
847cdcca1c
[CI] Upgrade codespell version. ( #5381 )
2024-06-12 10:06:14 -07:00
Simon Mo
e3c12bf6d2
Revert "[CI/Build] Add is_quant_method_supported to control quantization test configurations" ( #5463 )
2024-06-12 10:03:24 -07:00
Michael Goin
3dd6853bc8
[CI/Build] Add is_quant_method_supported to control quantization test configurations ( #5253 )
2024-06-12 09:58:02 -07:00
Nick Hill
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case ( #5230 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-11 11:10:41 -07:00
youkaichao
c4bd03c7c5
[Core][Distributed] add same-node detection ( #5369 )
2024-06-11 10:53:59 -07:00
sasha0552
dcbf4286af
[Frontend] Customizable RoPE theta ( #5197 )
2024-06-11 10:42:26 -07:00
Cyrus Leung
640052b069
[Bugfix][Frontend] Cleanup "fix chat logprobs" ( #5026 )
2024-06-10 22:36:46 -07:00
maor-ps
351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs ( #5312 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-11 10:30:31 +08:00