Commit Graph

364 Commits

Author SHA1 Message Date
Varun Sundar Rabindranath
19d02ff938
[Bugfix] Fix PP for Multi-Step (#8887) 2024-09-28 08:52:46 -07:00
Sebastian Schoennenbeck
bd429f2b75
[Core] Priority-based scheduling in async engine (#8850) 2024-09-27 15:07:10 -07:00
Varun Sundar Rabindranath
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-09-27 13:32:07 -07:00
Cyrus Leung
3b00b9c26c
[Core] renamePromptInputs and inputs (#8876) 2024-09-26 20:35:15 -07:00
Chen Zhang
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model (#8811)
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-25 13:29:32 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810) 2024-09-25 10:36:26 -07:00
科英
64840dfae4
[Frontend] MQLLMEngine supports profiling. (#8761) 2024-09-25 09:37:41 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility (#8760) 2024-09-25 09:36:47 -07:00
Joe Runde
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks (#8583) 2024-09-24 20:37:38 -07:00
Archit Patke
6da1ab6b41
[Core] Adding Priority Scheduling (#5958) 2024-09-24 19:50:50 -07:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-09-24 17:29:56 -07:00
Simon Mo
3185fb0cca
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" (#8750) 2024-09-24 05:45:20 +00:00
Alexander Matveev
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (#8335) 2024-09-23 15:38:04 -07:00
Alex Brooks
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs (#8657)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-23 07:44:48 +00:00
Cyrus Leung
0057894ef7
[Core] Rename PromptInputs and inputs(#8673) 2024-09-20 19:00:54 -07:00
Nick Hill
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too (#8584) 2024-09-19 12:51:06 -04:00
Joe Runde
0d47bf3bf4
[Bugfix] add dead_error property to engine client (#8574)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-18 22:10:01 +00:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) 2024-09-17 07:35:01 -07:00
Alex Brooks
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing (#8525)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-17 04:17:32 +00:00
Nick Hill
acd5511b6d
[BugFix] Fix clean shutdown issues (#8492) 2024-09-16 09:33:46 -07:00
William Lin
ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor (#8354) 2024-09-12 21:30:00 -07:00
Alexander Matveev
6821020109
[Bugfix] Fix async log stats (#8417) 2024-09-12 20:48:59 -07:00
Cyrus Leung
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class (#7329) 2024-09-13 02:56:13 +00:00
Roger Wang
c16369455f
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (#8425) 2024-09-12 14:06:51 -07:00
Nick Hill
551ce01078
[Core] Add engine option to return only deltas or final output (#7381) 2024-09-12 12:02:00 -07:00
youkaichao
f842a7aff1
[misc] remove engine_use_ray (#8126) 2024-09-11 18:23:36 -07:00
Aarni Koskela
8baa454937
[Misc] Move device options to a single place (#8322) 2024-09-11 13:25:58 -07:00
Cody Yu
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (#8342) 2024-09-10 22:28:28 +00:00
Alexander Matveev
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption (#8267) 2024-09-07 21:01:51 -07:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test (#8258) 2024-09-07 08:02:39 +00:00
William Lin
12dd715807
[misc] [doc] [frontend] LLM torch profiler support (#7943) 2024-09-06 17:48:48 -07:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format (#8168)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-06 17:02:05 -06:00
Harsha vardhan manoj Bikki
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… (#8062)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
2024-09-04 16:33:43 -07:00
Antoni Baum
652c83b697
[Misc] Raise a more informative exception in add/remove_logger (#7750) 2024-09-03 12:28:25 -07:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step (#8050) 2024-09-03 18:50:29 +00:00
Woosuk Kwon
0fbc6696c2
[Bugfix] Fix single output condition in output processor (#7881) 2024-09-02 20:35:42 -07:00
Isotr0py
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension (#8056) 2024-09-02 08:43:26 -04:00
Robert Shaw
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059) 2024-08-31 19:44:03 +00:00
Cyrus Leung
98cef6a227
[Core] Increase default max_num_batched_tokens for multimodal models (#8028) 2024-08-30 08:20:34 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step (#7652) 2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models (#7998) 2024-08-29 17:49:04 -07:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step (#7921) 2024-08-29 11:18:26 -07:00
Alexander Matveev
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911) 2024-08-28 00:02:30 -07:00
Kunshang Ji
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810) 2024-08-27 10:07:02 -07:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739) 2024-08-27 12:40:02 +00:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
omrishiv
760e9f71a8
[Bugfix] neuron: enable tensor parallelism (#7562)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
2024-08-26 15:13:13 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836) 2024-08-26 05:31:10 +00:00