Antoni Baum
|
999ef0b917
|
[Misc] Add numpy implementation of compute_slot_mapping (#7377)
|
2024-08-09 22:52:29 +00:00 |
|
Dipika Sikka
|
5c6c54d67a
|
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models (#7376)
|
2024-08-09 21:23:46 +00:00 |
|
Mahesh Keralapura
|
933790c209
|
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089)
|
2024-08-09 13:55:13 -07:00 |
|
Roger Wang
|
70d268a399
|
[Bugfix] Fix ITL recording in serving benchmark (#7372)
|
2024-08-09 10:00:00 -07:00 |
|
Pooya Davoodi
|
249b88228d
|
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-09 09:48:21 -07:00 |
|
Alexander Matveev
|
74af2bbd90
|
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360)
|
2024-08-09 16:35:49 +00:00 |
|
Alexander Matveev
|
fc7b8d1eef
|
[Performance] e2e overheads reduction: Small followup diff (#7364)
|
2024-08-09 15:49:36 +00:00 |
|
Isotr0py
|
67abdbb42f
|
[VLM][Doc] Add stop_token_ids to InternVL example (#7354)
|
2024-08-09 14:51:04 +00:00 |
|
Mor Zusman
|
07ab160741
|
[Model][Jamba] Mamba cache single buffer (#6739)
Co-authored-by: Mor Zusman <morz@ai21.com>
|
2024-08-09 10:07:06 -04:00 |
|
Nick Hill
|
b4e9528f95
|
[Core] Streamline stream termination in AsyncLLMEngine (#7336)
|
2024-08-09 07:06:36 +00:00 |
|
William Lin
|
57b7be0e1c
|
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971)
|
2024-08-09 05:42:45 +00:00 |
|
Travis Johnson
|
99b4cf5f23
|
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-08-08 22:08:46 -07:00 |
|
Alexander Matveev
|
e02ac55617
|
[Performance] Optimize e2e overheads: Reduce python allocations (#7162)
|
2024-08-08 21:34:28 -07:00 |
|
Woosuk Kwon
|
73388c07a4
|
[TPU] Fix dockerfile.tpu (#7331)
|
2024-08-08 20:24:58 -07:00 |
|
Cyrus Leung
|
7eb4a51c5f
|
[Core] Support serving encoder/decoder models (#7258)
|
2024-08-09 10:39:41 +08:00 |
|
Siyuan Liu
|
0fa14907da
|
[TPU] Add Load-time W8A16 quantization for TPU Backend (#7005)
|
2024-08-08 18:35:49 -07:00 |
|
Simon Mo
|
5923532e15
|
Add Skywork AI as Sponsor (#7314)
|
2024-08-08 13:59:57 -07:00 |
|
Jee Jee Li
|
a049b107e2
|
[Misc] Temporarily resolve the error of BitAndBytes (#7308)
|
2024-08-08 13:42:58 -07:00 |
|
Isotr0py
|
8334c39f37
|
[Bugfix] Fix new Llama3.1 GGUF model loading (#7269)
|
2024-08-08 13:42:44 -07:00 |
|
Daniele
|
e904576743
|
[CI/Build] Dockerfile.cpu improvements (#7298)
|
2024-08-08 15:24:52 -04:00 |
|
Michael Goin
|
e14fb22e59
|
[Doc] Put collect_env issue output in a <detail> block (#7310)
|
2024-08-08 11:22:49 -07:00 |
|
Zach Zheng
|
782e53ab59
|
[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849)
|
2024-08-08 10:43:30 -07:00 |
|
Joe Runde
|
21b9c49aa3
|
[Frontend] Kill the server on engine death (#6594)
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-08-08 09:47:48 -07:00 |
|
Luka Govedič
|
5fb4a3f678
|
[Bugfix][Kernel] Increased atol to fix failing tests (#7305)
|
2024-08-08 12:16:13 -04:00 |
|
Jee Jee Li
|
757ac70a64
|
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 (#7273)
|
2024-08-08 14:02:41 +00:00 |
|
Murali Andoorveedu
|
6dffa4b0a6
|
[Bugfix] Fix LoRA with PP (#7292)
|
2024-08-08 00:02:27 -07:00 |
|
Cherilyn Buren
|
48abee9e54
|
[Frontend] remove max_num_batched_tokens limit for lora (#7288)
|
2024-08-08 06:17:29 +00:00 |
|
Rui Qiao
|
746709642c
|
[Misc] Fix typos in scheduler.py (#7285)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2024-08-07 17:06:01 -07:00 |
|
Lily Liu
|
e53dfd3eaf
|
[Kernel] Fix Flashinfer Correctness (#7284)
|
2024-08-07 16:26:52 -07:00 |
|
Michael Goin
|
6d94420246
|
[Doc] Update supported_hardware.rst (#7276)
|
2024-08-07 14:21:50 -07:00 |
|
Nick Hill
|
fc1493a01e
|
[FrontEnd] Make merge_async_iterators is_cancelled arg optional (#7282)
|
2024-08-07 13:35:14 -07:00 |
|
Lucas Wilkinson
|
311f743831
|
[Bugfix] Fix gptq failure on T4s (#7264)
|
2024-08-07 20:05:37 +00:00 |
|
Kevin H. Luu
|
469b3bc538
|
[ci] Make building wheels per commit optional (#7278)
Signed-off-by: kevin <kevin@anyscale.com>
|
2024-08-07 11:34:25 -07:00 |
|
Michael Goin
|
5223199e03
|
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219)
|
2024-08-07 11:23:12 -07:00 |
|
Maximilien de Bayser
|
fde47d3bc2
|
[BugFix] Fix frontend multiprocessing hang (#7217)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-08-07 18:09:36 +00:00 |
|
Stas Bekman
|
0e12cd67a8
|
[Doc] add online speculative decoding example (#7243)
|
2024-08-07 09:58:02 -07:00 |
|
Ilya Lavrenov
|
80cbe10c59
|
[OpenVINO] migrate to latest dependencies versions (#7251)
|
2024-08-07 09:49:10 -07:00 |
|
Isotr0py
|
b764547616
|
[Bugfix] Fix input processor for InternVL2 model (#7164)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-08-07 09:32:07 -07:00 |
|
Rafael Vasquez
|
ab0f5e2823
|
Fixes typo in function name (#7275)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
|
2024-08-07 09:29:27 -07:00 |
|
Robert Shaw
|
564985729a
|
[ BugFix ] Move zmq frontend to IPC instead of TCP (#7222)
|
2024-08-07 16:24:56 +00:00 |
|
Dipika Sikka
|
0f7052bc7e
|
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 (#5874)
|
2024-08-07 09:17:58 -07:00 |
|
youkaichao
|
639159b2a6
|
[distributed][misc] add specialized method for cuda platform (#7249)
|
2024-08-07 08:54:52 -07:00 |
|
Cyrus Leung
|
66d617e343
|
[Frontend] Gracefully handle missing chat template and fix CI failure (#7238)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-08-07 09:12:05 +00:00 |
|
Atilla Akkuş
|
7b261092de
|
[BUGFIX]: top_k is expected to be an integer. (#7227)
|
2024-08-07 00:32:16 -07:00 |
|
Roger Wang
|
2385c8f374
|
[Doc] Mock new dependencies for documentation (#7245)
|
2024-08-07 06:43:03 +00:00 |
|
Nick Hill
|
9a3f49ae07
|
[BugFix] Overhaul async request cancellation (#7111)
|
2024-08-07 13:21:41 +08:00 |
|
Michael Goin
|
f9a5600649
|
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225)
|
2024-08-06 18:34:26 -07:00 |
|
afeldman-nm
|
fd95e026e0
|
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-06 16:51:47 -04:00 |
|
xiaobochen123
|
660470e5a3
|
[Core] Optimize evictor-v2 performance (#7193)
|
2024-08-06 12:34:25 -07:00 |
|
Luka Govedič
|
8d59dbb000
|
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-08-06 18:17:08 +00:00 |
|