Commit Graph

2388 Commits

Author SHA1 Message Date
youkaichao
9fadc7b7a0
[misc] add zmq in collect env (#7119) 2024-08-03 22:03:46 -07:00
Yihuan Bu
654bc5ca49
Support for guided decoding for offline LLM (#6878)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-04 03:12:09 +00:00
Jeff Fialho
825b044863
[Frontend] Warn if user max_model_len is greater than derived max_model_len (#7080)
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-03 16:01:38 -07:00
youkaichao
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test (#7114) 2024-08-03 10:44:53 -07:00
Kuntai Du
67d745cc68
[CI] Temporarily turn off H100 performance benchmark (#7104) 2024-08-02 23:52:44 -07:00
Jee Jee Li
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA (#7081) 2024-08-02 22:40:19 -07:00
Zach Zheng
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits (#7018) 2024-08-02 22:38:15 -07:00
Isotr0py
0c25435daa
[Model] Refactor and decouple weight loading logic for InternVL2 model (#7067) 2024-08-02 22:36:14 -07:00
youkaichao
a0d164567c
[ci][distributed] disable ray dag tests (#7099) 2024-08-02 22:32:04 -07:00
youkaichao
04e5583425
[ci][distributed] merge distributed test commands (#7097)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-02 21:33:53 -07:00
Cyrus Leung
8c025fa703
[Frontend] Factor out chat message parsing (#7055) 2024-08-02 21:31:27 -07:00
youkaichao
69ea15e5cc
[ci][distributed] shorten wait time if server hangs (#7098) 2024-08-02 21:05:16 -07:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq (#6883)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-02 18:27:28 -07:00
youkaichao
708989341e
[misc] add a flag to enable compile (#7092) 2024-08-02 16:18:45 -07:00
Rui Qiao
22e718ff1a
[Misc] Revive to use loopback address for driver IP (#7091)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-02 15:50:00 -07:00
Rui Qiao
05308891e2
[Core] Pipeline parallel with Ray ADAG (#6837)
Support pipeline-parallelism with Ray accelerated DAG.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-02 13:55:40 -07:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
Michael Goin
b482b9a5b1
[CI/Build] Add support for Python 3.12 (#7035) 2024-08-02 13:51:22 -07:00
youkaichao
806949514a
[ci] set timeout for test_oot_registration.py (#7082) 2024-08-02 10:03:24 -07:00
Jie Fu (傅杰)
c16eaac500
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend (#6931) 2024-08-02 08:55:58 -07:00
Peng Guanwen
db35186391
[Core] Comment out unused code in sampler (#7023) 2024-08-02 00:58:26 -07:00
youkaichao
660dea1235
[cuda][misc] remove error_on_invalid_device_count_status (#7069) 2024-08-02 00:14:21 -07:00
Bongwon Jang
cf2a1a4d9d
Fix tracing.py (#7065) 2024-08-01 23:28:00 -07:00
youkaichao
252357793d
[ci][distributed] try to fix pp test (#7054) 2024-08-01 22:03:12 -07:00
Cyrus Leung
3bb4b1e4cd
[mypy] Speed up mypy checking (#7056) 2024-08-01 19:49:43 -07:00
Lily Liu
954f7305a1
[Kernel] Fix input for flashinfer prefill wrapper. (#7008) 2024-08-01 18:44:16 -07:00
Woosuk Kwon
6ce01f3066
[Performance] Optimize get_seqs (#7051) 2024-08-01 18:29:52 -07:00
Tyler Michael Smith
6a11fdfbb8
[CI/Build][Bugfix] Fix CUTLASS header-only line (#7034) 2024-08-01 13:51:15 -07:00
Woosuk Kwon
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn (#7022) 2024-08-01 13:14:37 -07:00
omkar kakarparthi
562e580abc
Update run-amd-test.sh (#7044) 2024-08-01 13:12:37 -07:00
Murali Andoorveedu
fc912e0886
[Models] Support Qwen model with PP (#6974)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-08-01 12:40:43 -07:00
Michael Goin
f4fd390f5d
[Bugfix] Lower gemma's unloaded_params exception to warning (#7002) 2024-08-01 12:01:07 -07:00
Michael Goin
fb3db61688
[CI/Build] Remove sparseml requirement from testing (#7037) 2024-08-01 12:00:51 -07:00
Isotr0py
2dd34371a6
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm (#6992) 2024-08-01 12:00:28 -07:00
Sage Moore
7e0861bd0b
[CI/Build] Update PyTorch to 2.4.0 (#6951)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-01 11:11:24 -07:00
Alexei-V-Ivanov-AMD
a72a424b3e
[Build/CI] Fixing Docker Hub quota issue. (#7043) 2024-08-01 11:07:37 -07:00
youkaichao
c8a7e93273
[core][scheduler] simplify and improve scheduler (#6867) 2024-07-31 23:51:09 -07:00
zifeitong
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954) 2024-07-31 21:13:34 -07:00
Aurick Qiao
0437492ea9
PP comm optimization: replace send with partial send + allgather (#6695)
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
2024-07-31 20:15:42 -07:00
Travis Johnson
630dd9e0ae
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings (#6758)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-31 19:49:11 -07:00
Woosuk Kwon
23993a7997
[Bugfix][TPU] Do not use torch.Generator for TPUs (#6981) 2024-07-31 18:50:28 -07:00
xuyi
1d2e7fb73f
[Model] Pipeline parallel support for Qwen2 (#6924) 2024-07-31 18:49:51 -07:00
Jee Jee Li
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
Simon Mo
7eb0cb4a14
Revert "[Frontend] Factor out code for running uvicorn" (#7012)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-07-31 16:34:26 -07:00
Michael Goin
a0dce9383a
[Misc] Add compressed-tensors to optimized quant list (#7006) 2024-07-31 14:40:44 -07:00
Varun Sundar Rabindranath
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath
93548eb37e
[Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:22 -07:00
Michael Goin
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization (#6960) 2024-07-31 12:47:46 -07:00
Cody Yu
bd70013407
[MISC] Introduce pipeline parallelism partition strategies (#6920)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-31 12:02:17 -07:00
Avshalom Manevich
2ee8d3ba55
[Model] use FusedMoE layer in Jamba (#6935) 2024-07-31 12:00:24 -07:00