Varun Sundar Rabindranath
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
Sebastian Schoennenbeck
bd429f2b75
[Core] Priority-based scheduling in async engine ( #8850 )
2024-09-27 15:07:10 -07:00
Varun Sundar Rabindranath
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-09-27 13:32:07 -07:00
Cyrus Leung
3b00b9c26c
[Core] renamePromptInputs and inputs ( #8876 )
2024-09-26 20:35:15 -07:00
Chen Zhang
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-25 13:29:32 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility ( #8760 ) ( #8810 )
2024-09-25 10:36:26 -07:00
科英
64840dfae4
[Frontend] MQLLMEngine supports profiling. ( #8761 )
2024-09-25 09:37:41 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility ( #8760 )
2024-09-25 09:36:47 -07:00
Joe Runde
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks ( #8583 )
2024-09-24 20:37:38 -07:00
Archit Patke
6da1ab6b41
[Core] Adding Priority Scheduling ( #5958 )
2024-09-24 19:50:50 -07:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-09-24 17:29:56 -07:00
Simon Mo
3185fb0cca
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" ( #8750 )
2024-09-24 05:45:20 +00:00
Alexander Matveev
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse ( #8335 )
2024-09-23 15:38:04 -07:00
Alex Brooks
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs ( #8657 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-23 07:44:48 +00:00
Cyrus Leung
0057894ef7
[Core] Rename PromptInputs and inputs( #8673 )
2024-09-20 19:00:54 -07:00
Nick Hill
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too ( #8584 )
2024-09-19 12:51:06 -04:00
Joe Runde
0d47bf3bf4
[Bugfix] add dead_error property to engine client ( #8574 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-18 22:10:01 +00:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
Alex Brooks
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing ( #8525 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-17 04:17:32 +00:00
Nick Hill
acd5511b6d
[BugFix] Fix clean shutdown issues ( #8492 )
2024-09-16 09:33:46 -07:00
William Lin
ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor ( #8354 )
2024-09-12 21:30:00 -07:00
Alexander Matveev
6821020109
[Bugfix] Fix async log stats ( #8417 )
2024-09-12 20:48:59 -07:00
Cyrus Leung
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class ( #7329 )
2024-09-13 02:56:13 +00:00
Roger Wang
c16369455f
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models ( #8425 )
2024-09-12 14:06:51 -07:00
Nick Hill
551ce01078
[Core] Add engine option to return only deltas or final output ( #7381 )
2024-09-12 12:02:00 -07:00
youkaichao
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
Aarni Koskela
8baa454937
[Misc] Move device options to a single place ( #8322 )
2024-09-11 13:25:58 -07:00
Cody Yu
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled ( #8342 )
2024-09-10 22:28:28 +00:00
Alexander Matveev
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption ( #8267 )
2024-09-07 21:01:51 -07:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
William Lin
12dd715807
[misc] [doc] [frontend] LLM torch profiler support ( #7943 )
2024-09-06 17:48:48 -07:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-06 17:02:05 -06:00
Harsha vardhan manoj Bikki
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… ( #8062 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
2024-09-04 16:33:43 -07:00
Antoni Baum
652c83b697
[Misc] Raise a more informative exception in add/remove_logger ( #7750 )
2024-09-03 12:28:25 -07:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
Woosuk Kwon
0fbc6696c2
[Bugfix] Fix single output condition in output processor ( #7881 )
2024-09-02 20:35:42 -07:00
Isotr0py
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension ( #8056 )
2024-09-02 08:43:26 -04:00
Robert Shaw
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation ( #8059 )
2024-08-31 19:44:03 +00:00
Cyrus Leung
98cef6a227
[Core] Increase default max_num_batched_tokens for multimodal models ( #8028 )
2024-08-30 08:20:34 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
Alexander Matveev
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) ( #7911 )
2024-08-28 00:02:30 -07:00
Kunshang Ji
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. ( #7810 )
2024-08-27 10:07:02 -07:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
omrishiv
760e9f71a8
[Bugfix] neuron: enable tensor parallelism ( #7562 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
2024-08-26 15:13:13 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer ( #7836 )
2024-08-26 05:31:10 +00:00