Woosuk Kwon
|
5d5b4c5fe5
|
[Bugfix][TPU] Add missing None to model input (#6245)
|
2024-07-09 00:21:37 -07:00 |
|
youkaichao
|
70c232f85a
|
[core][distributed] fix ray worker rank assignment (#6235)
|
2024-07-08 21:31:44 -07:00 |
|
youkaichao
|
a3c9435d93
|
[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability (#6216)
|
2024-07-08 20:02:15 -07:00 |
|
Simon Mo
|
4f0e0ea131
|
Add FlashInfer to default Dockerfile (#6172)
|
2024-07-08 13:38:03 -07:00 |
|
tomeras91
|
ddc369fba1
|
[Bugfix] Mamba cache Cuda Graph padding (#6214)
|
2024-07-08 11:25:51 -07:00 |
|
Eric
|
185ad31f37
|
[Bugfix] use diskcache in outlines _get_guide #5436 (#6203)
|
2024-07-08 11:23:24 -07:00 |
|
afeldman-nm
|
543aa48573
|
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-07-08 17:12:15 +00:00 |
|
Avshalom Manevich
|
f7a8fa39d8
|
[Kernel] reloading fused_moe config on the last chunk (#6210)
|
2024-07-08 08:00:38 -07:00 |
|
Haichuan
|
717f4bcea0
|
Feature/add benchmark testing (#5947)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-08 07:52:06 +00:00 |
|
kczimm
|
16620f439d
|
do not exclude object field in CompletionStreamResponse (#6196)
|
2024-07-08 10:32:57 +08:00 |
|
youkaichao
|
3b08fe2b13
|
[misc][frontend] log all available endpoints (#6195)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-07-07 15:11:12 -07:00 |
|
Robert Shaw
|
abfe705a02
|
[ Misc ] Support Fp8 via llm-compressor (#6110)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-07-07 20:42:11 +00:00 |
|
Haichuan
|
333306a252
|
add benchmark for fix length input and output (#5857)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-07 07:42:13 +00:00 |
|
Roger Wang
|
6206dcb29e
|
[Model] Add PaliGemma (#5189)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-07-07 09:25:50 +08:00 |
|
Cyrus Leung
|
9389380015
|
[Doc] Move guide for multimodal model and other improvements (#6168)
|
2024-07-06 17:18:59 +08:00 |
|
Roger Wang
|
175c43eca4
|
[Doc] Reorganize Supported Models by Type (#6167)
|
2024-07-06 05:59:36 +00:00 |
|
Simon Mo
|
bc96d5c330
|
Move release wheel env var to Dockerfile instead (#6163)
|
2024-07-05 17:19:53 -07:00 |
|
Simon Mo
|
f0250620dd
|
Fix release wheel build env var (#6162)
|
2024-07-05 16:24:31 -07:00 |
|
Simon Mo
|
2de490d60f
|
Update wheel builds to strip debug (#6161)
|
2024-07-05 14:51:25 -07:00 |
|
Simon Mo
|
79d406e918
|
[Docs] Fix readthedocs for tag build (#6158)
|
2024-07-05 12:44:40 -07:00 |
|
Simon Mo
|
abad5746a7
|
bump version to v0.5.1 (#6157)
|
2024-07-05 12:04:51 -07:00 |
|
JGSweets
|
e58294ddf2
|
[Bugfix] Add verbose error if scipy is missing for blocksparse attention (#5695)
|
2024-07-05 10:41:01 -07:00 |
|
jvlunteren
|
f1e15da6fe
|
[Frontend] Continuous usage stats in OpenAI completion API (#5742)
|
2024-07-05 10:37:09 -07:00 |
|
Christian Rohmann
|
0097bb1829
|
[Bugfix] Use templated datasource in grafana.json to allow automatic imports (#6136)
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
|
2024-07-05 09:49:47 -07:00 |
|
Cyrus Leung
|
ea4b570483
|
[VLM] Cleanup validation and update docs (#6149)
|
2024-07-05 05:49:38 +00:00 |
|
Roger Wang
|
a41357e941
|
[VLM] Improve consistency between feature size calculation and dummy data for profiling (#6146)
|
2024-07-05 09:29:47 +08:00 |
|
Cyrus Leung
|
ae96ef8fbd
|
[VLM] Calculate maximum number of multi-modal tokens by model (#6121)
|
2024-07-04 16:37:23 -07:00 |
|
Lily Liu
|
69ec3ca14c
|
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-07-04 16:35:51 -07:00 |
|
Yuan
|
81d7a50f24
|
[Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
|
2024-07-04 15:22:12 -07:00 |
|
youkaichao
|
27902d42be
|
[misc][doc] try to add warning for latest html (#5979)
|
2024-07-04 09:57:09 -07:00 |
|
Gregory Shtrasberg
|
56b325e977
|
[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention (#6043)
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
|
2024-07-03 22:19:38 -07:00 |
|
Cyrus Leung
|
3dd507083f
|
[CI/Build] Cleanup VLM tests (#6107)
|
2024-07-03 18:58:18 -07:00 |
|
Murali Andoorveedu
|
0ed646b7aa
|
[Distributed][Core] Support Py39 and Py38 for PP (#6120)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-03 17:52:29 -07:00 |
|
Travis Johnson
|
1dab9bc8a9
|
[Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing (#6109)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-07-03 16:56:59 -07:00 |
|
youkaichao
|
3de6e6a30e
|
[core][distributed] support n layers % pp size != 0 (#6115)
|
2024-07-03 16:40:31 -07:00 |
|
youkaichao
|
966fe72141
|
[doc][misc] bump up py version in installation doc (#6119)
|
2024-07-03 15:52:04 -07:00 |
|
Robert Shaw
|
62963d129e
|
[ Misc ] Clean Up CompressedTensorsW8A8 (#6113)
|
2024-07-03 22:50:08 +00:00 |
|
xwjiang2010
|
d9e98f42e4
|
[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-03 22:14:16 +00:00 |
|
youkaichao
|
3c6325f0fc
|
[core][distributed] custom allreduce when pp size > 1 (#6117)
|
2024-07-03 14:41:32 -07:00 |
|
Michael Goin
|
47f0954af0
|
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975)
|
2024-07-03 17:38:00 +00:00 |
|
Roger Wang
|
7cd2ebb025
|
[Bugfix] Fix compute_logits in Jamba (#6093)
|
2024-07-03 00:32:35 -07:00 |
|
Roger Wang
|
f1c78138aa
|
[Doc] Fix Mock Import (#6094)
|
2024-07-03 00:13:56 -07:00 |
|
Roger Wang
|
3a86b54fb0
|
[VLM][Frontend] Proper Image Prompt Formatting from OpenAI API (#6091)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-07-02 23:41:23 -07:00 |
|
youkaichao
|
f666207161
|
[misc][distributed] error on invalid state (#6092)
|
2024-07-02 23:37:29 -07:00 |
|
Nick Hill
|
d830656a97
|
[BugFix] Avoid unnecessary Ray import warnings (#6079)
|
2024-07-03 14:09:40 +08:00 |
|
SangBin Cho
|
d18bab3587
|
[CI] Fix base url doesn't strip "/" (#6087)
|
2024-07-02 21:31:25 -07:00 |
|
Cyrus Leung
|
9831aec49f
|
[Core] Dynamic image size support for VLMs (#5276)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: ywang96 <ywang@roblox.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-07-02 20:34:00 -07:00 |
|
youkaichao
|
482045ee77
|
[hardware][misc] introduce platform abstraction (#6080)
|
2024-07-02 20:12:22 -07:00 |
|
Mor Zusman
|
9d6a8daa87
|
[Model] Jamba support (#4115)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-02 23:11:29 +00:00 |
|
Qubitium-ModelCloud
|
ee93f4f92a
|
[CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
|
2024-07-02 22:25:17 +00:00 |
|