Chen Zhang
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger ( #9530 )
2024-10-20 00:05:02 +00:00
Yue Zhang
c5eea3c8ba
[Frontend] Support simpler image input format ( #9478 )
2024-10-18 23:17:07 -07:00
Joe Runde
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-18 21:25:19 -04:00
sasha0552
337ed76671
[Bugfix] Fix offline mode when using mistral_common ( #9457 )
2024-10-18 18:12:32 -07:00
Cody Yu
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py ( #9510 )
2024-10-18 14:30:55 -07:00
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
tomeras91
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser ( #9154 )
2024-10-18 10:27:48 +00:00
Joe Runde
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Robert Shaw
343f8e0905
Support BERTModel (first encoder-only embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: laishzh <laishengzhang@gmail.com>
Co-authored-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-10-17 23:21:01 +00:00
bnellnm
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
Luka Govedič
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Mor Zusman
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba ( #9189 )
2024-10-16 12:12:43 -04:00
Cyrus Leung
cee711fdbb
[Core] Rename input data types ( #8688 )
2024-10-16 10:49:37 +00:00
Cyrus Leung
7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM ( #9303 )
2024-10-16 14:31:00 +08:00
Cyrus Leung
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
Chang Su
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids ( #9034 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-15 15:40:43 -07:00
Michael Goin
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions ( #8909 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-15 15:40:25 -07:00
Nick Hill
e9d517f276
[BugFix] Fix chat API continuous usage stats ( #9357 )
2024-10-14 23:19:48 -07:00
Xiang Xu
f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images ( #9095 )
2024-10-14 15:24:26 -07:00
Lily Liu
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
sixgod
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
Tyler Michael Smith
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
Jee Jee Li
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
youkaichao
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-10-10 21:30:44 -07:00
youkaichao
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
Lucas Wilkinson
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
Michael Goin
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
Li, Jiang
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
youkaichao
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
chenqianfzh
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
bnellnm
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
Daniele
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
Alex Brooks
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:31:26 +00:00
Alex Brooks
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-08 14:12:56 +00:00
Brendan Wong
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
youkaichao
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
Isotr0py
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
Isotr0py
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
Cyrus Leung
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
youkaichao
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
Varun Sundar Rabindranath
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-10-06 12:48:11 -07:00
Cyrus Leung
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-10-06 16:35:27 +08:00
Brendan Wong
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-10-05 23:39:03 -07:00
Andy Dai
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
Chen Zhang
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
Xin Yang
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
ElizaWszola
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
2024-10-04 12:34:44 -06:00
Flávia Béo
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
2024-10-04 18:31:40 +00:00