Commit Graph

2383 Commits

Author SHA1 Message Date
Kuntai Du
3d8a5f063d
[CI] Organizing performance benchmark files (#7616) 2024-08-19 22:43:54 -07:00
Zijian Hu
f4fc7337bf
[Bugfix] support tie_word_embeddings for all models (#5724) 2024-08-19 20:00:04 -07:00
Kevin H. Luu
0df7ec0b2d
[ci] Install Buildkite test suite analysis (#7667)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-19 19:55:04 -07:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion (#7508) 2024-08-19 17:58:14 -07:00
youkaichao
e54ebc2f8f
[doc] fix doc build error caused by msgspec (#7659) 2024-08-19 17:50:59 -07:00
Travis Johnson
67e02fa8a4
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding (#7665)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-20 00:43:09 +00:00
Woosuk Kwon
43735bf5e1
[TPU] Remove redundant input tensor cloning (#7660) 2024-08-19 15:55:04 -07:00
Andrew Song
da115230fd
[Bugfix] Don't disable existing loggers (#7664) 2024-08-19 15:11:58 -07:00
Isotr0py
7601cb044d
[Core] Support tensor parallelism for GGUF quantization (#7520) 2024-08-19 17:30:14 -04:00
William Lin
47b65a5508
[core] Multi Step Scheduling (#7000)
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
2024-08-19 13:52:13 -07:00
Ali Panahi
dad961ef5c
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 (#5428) 2024-08-19 20:47:00 +00:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics (#7606) 2024-08-19 11:52:07 -07:00
Woosuk Kwon
df845b2b46
[Misc] Remove Gemma RoPE (#7638) 2024-08-19 09:29:31 -07:00
Kunshang Ji
1a36287b89
[Bugfix] Fix xpu build (#7644) 2024-08-18 22:00:09 -07:00
Peng Guanwen
f710fb5265
[Core] Use flashinfer sampling kernel when available (#7137)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-19 03:24:03 +00:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization (#7109) 2024-08-18 17:57:20 -07:00
Woosuk Kwon
200a2ffa6b
[Misc] Refactor Llama3 RoPE initialization (#7637) 2024-08-18 17:18:12 -07:00
Alex Brooks
40e1360bb6
[CI/Build] Add text-only test for Qwen models (#7475)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-08-19 07:43:46 +08:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend (#7279)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-18 20:19:48 +00:00
Woosuk Kwon
ab7165f2c7
[TPU] Optimize RoPE forward_native2 (#7636) 2024-08-18 01:15:10 -07:00
Woosuk Kwon
0c2fa50b84
[TPU] Use mark_dynamic only for dummy run (#7634) 2024-08-18 00:18:53 -07:00
Woosuk Kwon
ce143353c6
[TPU] Skip creating empty tensor (#7630) 2024-08-17 14:22:46 -07:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling (#7530) 2024-08-17 13:30:55 -07:00
Jee Jee Li
1ef13cf92f
[Misc]Fix BitAndBytes exception messages (#7626) 2024-08-17 12:02:14 -07:00
youkaichao
832163b875
[ci][test] allow longer wait time for api server (#7629) 2024-08-17 11:26:38 -07:00
Besher Alkurdi
e73f76eec6
[Model] Pipeline parallel support for JAIS (#7603) 2024-08-17 11:11:09 -07:00
youkaichao
d95cc0a55c
[core][misc] update libcudart finding (#7620)
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com>
2024-08-16 23:01:35 -07:00
youkaichao
5bf45db7df
[ci][test] fix engine/logger test (#7621) 2024-08-16 23:00:59 -07:00
youkaichao
eed020f673
[misc] use nvml to get consistent device name (#7582) 2024-08-16 21:15:13 -07:00
Xander Johnson
7c0b7ea214
[Bugfix] add >= 1.0 constraint for openai dependency (#7612) 2024-08-16 20:56:01 -07:00
SangBin Cho
4706eb628e
[aDAG] Unflake aDAG + PP tests (#7600) 2024-08-16 20:49:30 -07:00
Rui Qiao
bae888cb8e
[Bugfix] Clear engine reference in AsyncEngineRPCServer (#7618)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-16 20:44:05 -07:00
Alexei-V-Ivanov-AMD
6bd19551b0
.[Build/CI] Enabling passing AMD tests. (#7610) 2024-08-16 20:25:32 -07:00
bnellnm
e680349994
[Bugfix] Fix custom_ar support check (#7617) 2024-08-16 19:05:49 -07:00
Michael Goin
44f26a9466
[Model] Align nemotron config with final HF state and fix lm-eval-small (#7611) 2024-08-16 15:56:34 -07:00
bnellnm
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596) 2024-08-16 14:00:11 -07:00
bnellnm
7759ae958f
[Kernel][Misc] dynamo support for ScalarType (#7594) 2024-08-16 13:59:49 -07:00
bnellnm
9f69856356
[Kernel] register punica functions as torch ops (#7591) 2024-08-16 13:59:38 -07:00
Michael Goin
d4f0f17b02
[Doc] Update quantization supported hardware table (#7595) 2024-08-16 13:59:27 -07:00
Michael Goin
b3f4e17935
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444) 2024-08-16 13:59:16 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 (#7440)
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
2024-08-16 13:46:01 -07:00
William Lin
f366f6339b
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-16 11:41:56 -07:00
Michael Goin
855866caa9
[Kernel] Add tuned triton configs for ExpertsInt8 (#7601) 2024-08-16 11:37:01 -07:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE (#7415) 2024-08-16 10:06:51 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) 2024-08-16 10:06:30 -07:00
fzyzcjy
ec724a725e
support tqdm in notebooks (#7510) 2024-08-16 09:17:50 -07:00
Gordon Wong
0e39a33c6d
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method (#7513) 2024-08-16 10:05:18 -06:00
Kuntai Du
6fc5b0f249
[CI] Fix crashes of performance benchmark (#7500) 2024-08-16 08:08:45 -07:00
Nick Hill
9587b050fb
[Core] Use uvloop with zmq-decoupled front-end (#7570) 2024-08-15 22:48:07 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops (#7536) 2024-08-15 22:38:56 -07:00