Commit Graph

1090 Commits

Author SHA1 Message Date
Woosuk Kwon
395aa823ea
[Misc] Minor type annotation fix (#3716) 2024-03-28 21:12:24 -07:00
SangBin Cho
26422e477b
[Test] Make model tests run again and remove --forked from pytest (#3631)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-28 21:06:40 -07:00
youkaichao
f342153b48
Revert "bump version to v0.4.0" (#3708) 2024-03-28 18:49:42 -07:00
Simon Mo
27a57cad52
bump version to v0.4.0 (#3705) 2024-03-28 18:26:51 -07:00
Yile (Michael) Gu
98a42e7078
[Benchmark] Change mii to use persistent deployment and support tensor parallel (#3628) 2024-03-28 17:33:52 -07:00
youkaichao
0267fef52a
[Core] fix del of communicator (#3702) 2024-03-29 00:24:58 +00:00
Simon Mo
4716a32dd4
fix logging msg for block manager (#3701) 2024-03-28 23:29:55 +00:00
Woosuk Kwon
c0935c96d3
[Bugfix] Set enable_prefix_caching=True in prefix caching example (#3703) 2024-03-28 16:26:30 -07:00
Woosuk Kwon
cb40b3ab6b
[Kernel] Add MoE Triton kernel configs for A100 40GB (#3700) 2024-03-28 15:26:24 -07:00
Roy
515386ef3c
[Core] Support multi-node inference(eager and cuda graph) (#3686) 2024-03-28 15:01:55 -07:00
Simon Mo
a4075cba4d
[CI] Add test case to run examples scripts (#3638) 2024-03-28 14:36:10 -07:00
Simon Mo
96aa014d1e
fix benchmark format reporting in buildkite (#3693) 2024-03-28 14:35:16 -07:00
Adam Boeglin
1715056fef
[Bugfix] Update neuron_executor.py to add optional vision_language_config (#3695) 2024-03-28 10:43:34 -07:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update (#3538) 2024-03-28 10:06:01 -07:00
Roger Wang
ce567a2926
[Kernel] DBRX Triton MoE kernel H100 (#3692) 2024-03-28 10:05:34 -07:00
wenyujin333
d6ea427f04
[Model] Add support for Qwen2MoeModel (#3346) 2024-03-28 15:19:59 +00:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability (#3492) 2024-03-27 23:59:28 -07:00
Woosuk Kwon
8267b06c30
[Kernel] Add Triton MoE kernel configs for DBRX on A100 (#3679) 2024-03-27 22:22:25 -07:00
youkaichao
3492859b68
[CI/Build] update default number of jobs and nvcc threads to avoid overloading the system (#3675) 2024-03-28 00:18:54 -04:00
hxer7963
098e1776ba
[Model] Add support for xverse (#3610)
Co-authored-by: willhe <hexin@xverse.cn>
Co-authored-by: root <root@localhost.localdomain>
2024-03-27 18:12:54 -07:00
Roy
10e6322283
[Model] Fix and clean commandr (#3671) 2024-03-28 00:20:00 +00:00
Woosuk Kwon
6d9aa00fc4
[Docs] Add Command-R to supported models (#3669) 2024-03-27 15:20:00 -07:00
zeppombal
1182607e18
Add support for Cohere's Command-R model (#3433)
Co-authored-by: José Maria Pombal <jose.pombal@unbabel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-03-27 14:19:32 -07:00
Roger Wang
45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) 2024-03-27 13:39:26 -07:00
AmadeusChan
1956931436
[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621) 2024-03-27 13:39:05 -07:00
Megha Agarwal
e24336b5a7
[Model] Add support for DBRX (#3660) 2024-03-27 13:01:46 -07:00
youkaichao
d18f4e73f3
[Bugfix] [Hotfix] fix nccl library name (#3661) 2024-03-27 17:23:54 +00:00
Woosuk Kwon
82c540bebf
[Bugfix] More faithful implementation of Gemma (#3653) 2024-03-27 09:37:18 -07:00
youkaichao
8f44facddd
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00
Woosuk Kwon
e66b629c04
[Misc] Minor fix in KVCache type (#3652) 2024-03-26 23:14:06 -07:00
Jee Li
76879342a3
[Doc]add lora support (#3649) 2024-03-27 02:06:46 +00:00
Jee Li
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels (#3636) 2024-03-27 00:37:42 +00:00
Nick Hill
0dc72273b8
[BugFix] Fix ipv4 address parsing regression (#3645) 2024-03-26 14:39:44 -07:00
liiliiliil
a979d9771e
[Bugfix] Fix ipv6 address parsing bug (#3641) 2024-03-26 11:58:20 -07:00
Jee Li
8af890a865
Enable more models to inference based on LoRA (#3382)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
Nick Hill
dfeb2ecc3a
[Misc] Include matched stop string/token in responses (#2976)
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
2024-03-25 17:31:32 -07:00
Antoni Baum
3a243095e5
Optimize _get_ranks in Sampler (#3623) 2024-03-25 16:03:02 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. (#3042) 2024-03-25 14:16:30 -07:00
Simon Mo
f408d05c52
hotfix isort on logprobs ranks pr (#3622) 2024-03-25 11:55:46 -07:00
Dylan Hawk
0b4997e05c
[Bugfix] API stream returning two stops (#3450)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-03-25 10:14:34 -07:00
Travis Johnson
c13ad1b7bd
feat: implement the min_tokens sampling parameter (#3124)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-03-25 10:14:26 -07:00
Swapnil Parekh
819924e749
[Core] Adding token ranks along with logprobs (#3516)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
2024-03-25 10:13:10 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
TianYu GUO
e67c295b0c
[Bugfix] fix automatic prefix args and add log info (#3608) 2024-03-25 05:35:22 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 (#3462) 2024-03-25 04:39:33 +00:00
少年
b0dfa91dd7
[Model] Add starcoder2 awq support (#3569) 2024-03-24 21:07:36 -07:00
Woosuk Kwon
56a8652f33
[Bugfix] store lock file in tmp directory (#3578)" (#3599)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-03-24 20:06:50 -07:00
Kunshang Ji
6d93d35308
[BugFix] tensor.get_device() -> tensor.device (#3604) 2024-03-24 19:01:13 -07:00
youkaichao
837e185142
[CI/Build] fix flaky test (#3602) 2024-03-24 17:43:05 -07:00
youkaichao
42bc386129
[CI/Build] respect the common environment variable MAX_JOBS (#3600) 2024-03-24 17:04:00 -07:00