Commit Graph

601 Commits

Author SHA1 Message Date
Junda Chen
429284dc37
Fix dist.broadcast stall without group argument (#3408) 2024-03-14 23:25:05 -07:00
youkaichao
b522c4476f
[Misc] add HOST_IP env var (#3419)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-14 21:32:52 -07:00
Enrique Shockwave
b983ba35bd
fix marlin config repr (#3414) 2024-03-14 16:26:19 -07:00
陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Dan Clark
c17ca8ef18
Add args for mTLS support (#3410)
Co-authored-by: Daniel Clark <daniel.clark@ibm.com>
2024-03-14 13:11:45 -07:00
youkaichao
8fe8386591
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389) 2024-03-14 08:11:48 +00:00
Zhuohan Li
eeab52a4ff
[FIX] Simpler fix for async engine running on ray (#3371) 2024-03-13 14:18:40 -07:00
Antoni Baum
c33afd89f5
Fix lint (#3388) 2024-03-13 13:56:49 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
Hui Liu
ba8dc958a3
[Minor] Fix bias in if to remove ambiguity (#3259) 2024-03-13 09:16:55 -07:00
Bo-Wen Wang
b167109ba1
[Fix] Fix quantization="gptq" when using Marlin (#3319)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-03-12 22:51:42 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free (#3347) 2024-03-13 00:30:08 +00:00
DAIZHENWEI
654865e21d
Support Mistral Model Inference with transformers-neuronx (#3153) 2024-03-11 13:19:51 -07:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Nick Hill
4b59f00e91
[Fix] Fix best_of behavior when n=1 (#3298) 2024-03-10 19:17:46 -07:00
Roy
9e8744a545
[BugFix] Fix get tokenizer when using ray (#3301) 2024-03-10 19:17:16 -07:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) 2024-03-08 23:32:46 -08:00
Zhuohan Li
f48c6791b7
[FIX] Fix prefix test error on main (#3286) 2024-03-08 17:16:14 -08:00
Michael Goin
c2c5e0909a
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir (#3241) 2024-03-08 13:33:10 -08:00
Woosuk Kwon
1cb0cc2975
[FIX] Make flash_attn optional (#3269) 2024-03-08 10:52:20 -08:00
whyiug
c59e120c55
Feature add lora support for Qwen2 (#3177) 2024-03-07 21:58:24 -08:00
Nick Hill
d2339d6840
Connect engine healthcheck to openai server (#3260) 2024-03-07 16:38:12 -08:00
ElizaWszola
b35cc93420
Fix auto prefix bug (#3239) 2024-03-07 16:37:28 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263) 2024-03-07 23:03:22 +00:00
Michael Goin
385da2dae2
Measure model memory usage (#3120) 2024-03-07 11:42:42 -08:00
Woosuk Kwon
2daf23ab0c
Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
TechxGenus
d3c04b6a39
Add GPTQ support for Gemma (#3200) 2024-03-07 08:19:14 +08:00
Chujie Zheng
4cb3b924cd
Add tqdm dynamic_ncols=True (#3242) 2024-03-06 22:41:42 +00:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests (#3224) 2024-03-06 01:04:23 -08:00
Nick Hill
2efce05dc3
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-06 00:17:20 +00:00
Nick Hill
8999ec3c16
Store eos_token_id in Sequence for easy access (#3166) 2024-03-05 15:35:43 -08:00
Hongxia Yang
05af6da8d9
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#3123)
Co-authored-by: lcskrishna <lollachaitanya@gmail.com>
2024-03-04 18:14:53 -08:00
Antoni Baum
ff578cae54
Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
ttbachyinsda
76e8a70476
[Minor fix] The domain dns.google may cause a socket.gaierror exception (#3176)
Co-authored-by: guofangze <guofangze@kuaishou.com>
2024-03-04 19:17:12 +00:00
Philipp Moritz
17c3103c56
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-03 16:19:13 -08:00
Zhuohan Li
996d095c54
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158) 2024-03-03 14:37:18 -08:00
Jason Cox
d65fac2738
Add vLLM version info to logs and openai API server (#3161) 2024-03-02 21:00:29 -08:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
cloudhan
baee28c46c
Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104) 2024-03-02 14:34:48 +08:00
Allen.Dou
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. (#3109)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-01 23:28:41 +00:00
Woosuk Kwon
82091b864a
Bump up to v0.3.3 (#3129) 2024-03-01 12:58:06 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
Huarong
90fbf12540
fix relative import path of protocol.py (#3134)
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com>
2024-03-01 19:42:06 +00:00
Seonghyeon
27ca23dc00
Remove exclude_unset in streaming response (#3143) 2024-03-01 09:59:06 -08:00
Sherry
54d3544784
Fix: Output text is always truncated in some models (#3016) 2024-03-01 07:52:22 +00:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Nick Hill
29a8d6a554
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099) 2024-02-29 19:20:42 +00:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture (#3089) 2024-02-29 00:51:48 -08:00
Allen.Dou
9289e577ec
add cache_config's info to prometheus metrics. (#3100) 2024-02-29 06:15:18 +00:00
Jae-Won Chung
a6d471c759
Fix: AttributeError in OpenAI-compatible server (#3018) 2024-02-28 22:04:07 -08:00
CHU Tianxiang
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330) 2024-02-28 21:52:23 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API (#3027) 2024-02-27 11:51:53 +08:00
Woosuk Kwon
4bd18ec0c7
[Minor] Fix type annotation in fused moe (#3045) 2024-02-26 19:44:29 -08:00
Jingru
2410e320b3
fix get_ip error in pure ipv6 environment (#2931) 2024-02-26 19:22:16 -08:00
张大成
48a8f4a7fd
Support Orion model (#2539)
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-02-26 19:17:06 -08:00
Roy
4dd6416faf
Fix stablelm (#3038) 2024-02-26 18:31:10 -08:00
Roy
c1c0d00b88
Don't use cupy when enforce_eager=True (#3037) 2024-02-26 17:33:38 -08:00
Roy
d9f726c4d0
[Minor] Remove unused config files (#3039) 2024-02-26 17:25:22 -08:00
Philipp Moritz
cfc15a1031
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
Woosuk Kwon
f7c1234990
[Fix] Fissertion on YaRN model len (#2984) 2024-02-23 12:57:48 -08:00
zhaoyang-star
57f044945f
Fix nvcc not found in vlm-openai image (#2781) 2024-02-22 14:25:07 -08:00
Ronen Schaffer
4caf7044e0
Include tokens from prompt phase in counter_generation_tokens (#2802) 2024-02-22 14:00:12 -08:00
44670
c530e2cfe3
[FIX] Fix a bug in initializing Yarn RoPE (#2983) 2024-02-22 01:40:05 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870
chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Woosuk Kwon
95529e3253
Use Llama RMSNorm custom op for Gemma (#2974) 2024-02-21 18:28:23 -08:00
Roy
344020c926
Migrate MistralForCausalLM to LlamaForCausalLM (#2868) 2024-02-21 18:25:05 -08:00
Mustafa Eyceoz
5574081c49
Added early stopping to completion APIs (#2939) 2024-02-21 18:24:01 -08:00
Zhuohan Li
8fbd84bf78
Bump up version to v0.3.2 (#2968)
This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832).
2024-02-21 11:47:25 -08:00
Nick Hill
7d2dcce175
Support per-request seed (#2514) 2024-02-21 11:47:00 -08:00
Xiang Xu
5253edaacb
Add Gemma model (#2964) 2024-02-21 09:34:30 -08:00
Antoni Baum
017d9f1515
Add metrics to RequestOutput (#2876) 2024-02-20 21:55:57 -08:00
Antoni Baum
181b27d881
Make vLLM logging formatting optional (#2877) 2024-02-20 14:38:55 -08:00
Ronen Schaffer
e433c115bc
Fix vllm:prompt_tokens_total metric calculation (#2869) 2024-02-18 23:55:41 -08:00
Simon Mo
86fd8bb0ac
Add warning to prevent changes to benchmark api server (#2858) 2024-02-18 21:36:19 -08:00
Isotr0py
ab3a5a8259
Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
Zhuohan Li
537c9755a7
[Minor] Small fix to make distributed init logic in worker looks cleaner (#2905) 2024-02-18 14:39:00 -08:00
Mark Mozolewski
786b7f18a5
Add code-revision config argument for Hugging Face Hub (#2892) 2024-02-17 22:36:53 -08:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Nick Hill
185b2c29e2
Defensively copy sampling_params (#2881)
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.

Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
2024-02-17 11:18:04 -08:00
Woosuk Kwon
5f08050d8d
Bump up to v0.3.1 (#2887) 2024-02-16 15:05:18 -08:00
shiyi.c_98
64da65b322
Prefix Caching- fix t4 triton error (#2517) 2024-02-16 14:17:55 -08:00
Philipp Moritz
4f2ad11135
Fix DeciLM (#2883) 2024-02-14 22:29:57 -08:00
Woosuk Kwon
d7afab6d3a
[BugFix] Fix GC bug for LLM class (#2882) 2024-02-14 22:17:44 -08:00
Philipp Moritz
31348dff03
Align LoRA code between Mistral and Mixtral (fixes #2875) (#2880)
* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.

* Make LoRA logic for Mistral and Mixtral the same

---------

Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
2024-02-15 01:00:43 +01:00
Woosuk Kwon
25e86b6a61
Don't use cupy NCCL for AMD backends (#2855) 2024-02-14 12:30:44 -08:00
Roy
4efbac6d35
Migrate AquilaForCausalLM to LlamaForCausalLM (#2867) 2024-02-14 12:30:24 -08:00
Woosuk Kwon
7e45107f51
[Fix] Fix memory profiling when GPU is used by multiple processes (#2863) 2024-02-13 19:52:34 -08:00
Philipp Moritz
0c48b37c31
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861) 2024-02-13 18:01:15 -08:00
Philipp Moritz
7eacffd951
Migrate InternLMForCausalLM to LlamaForCausalLM (#2860)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 17:12:05 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral (#2831)
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
2024-02-14 00:55:45 +01:00
Philipp Moritz
317b29de0f
Remove Yi model definition, please use LlamaForCausalLM instead (#2854)
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 14:22:22 -08:00
Woosuk Kwon
a463c333dd
Use CuPy for CUDA graphs (#2811) 2024-02-13 11:32:06 -08:00
Philipp Moritz
ea356004d4
Revert "Refactor llama family models (#2637)" (#2851)
This reverts commit 5c976a7e1a.
2024-02-13 09:24:59 -08:00
Roy
5c976a7e1a
Refactor llama family models (#2637) 2024-02-13 00:09:23 -08:00
Rex
563836496a
Refactor 2 awq gemm kernels into m16nXk32 (#2723)
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
2024-02-12 11:02:17 -08:00
Hongxia Yang
0580aab02f
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768) 2024-02-10 23:14:37 -08:00
Woosuk Kwon
3711811b1d
Disable custom all reduce by default (#2808) 2024-02-08 09:58:03 -08:00
SangBin Cho
65b89d16ee
[Ray] Integration compiled DAG off by default (#2471) 2024-02-08 09:57:25 -08:00
Lily Liu
fe6d09ae61
[Minor] More fix of test_cache.py CI test failure (#2750) 2024-02-06 11:38:38 -08:00
liuyhwangyh
ed70c70ea3
modelscope: fix issue when model parameter is not a model id but path of the model. (#2489) 2024-02-06 09:57:15 -08:00
Woosuk Kwon
f0d4e14557
Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
Lukas
b92adec8e8
Set local logging level via env variable (#2774) 2024-02-05 14:26:50 -08:00
Rex
5a6c81b051
Remove eos tokens from output by default (#2611) 2024-02-04 14:32:42 -08:00
dancingpipi
51cd22ce56
set&get llm internal tokenizer instead of the TokenizerGroup (#2741)
Co-authored-by: shujunhua1 <shujunhua1@jd.com>
2024-02-04 14:25:36 -08:00
zspo
0e163fce18
Fix default length_penalty to 1.0 (#2667) 2024-02-01 15:59:39 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Pernekhan Utemuratov
c410f5d020
Use revision when downloading the quantization config file (#2697)
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
2024-02-01 15:41:58 -08:00
Simon Mo
b9e96b17de
fix python 3.8 syntax (#2716) 2024-02-01 14:00:58 -08:00
Fengzhe Zhou
cd9e60c76c
Add Internlm2 (#2666) 2024-02-01 09:27:40 -08:00
Robert Shaw
93b38bea5d
Refactor Prometheus and Add Request Level Metrics (#2316) 2024-01-31 14:58:07 -08:00
Philipp Moritz
d0d93b92b1
Add unit test for Mixtral MoE layer (#2677) 2024-01-31 14:34:17 -08:00
zspo
c664b0e683
fix some bugs (#2689) 2024-01-31 10:09:23 -08:00
Tao He
d69ff0cbbb
Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len (#2688)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-01-31 18:00:13 +01:00
Zhuohan Li
1af090b57d
Bump up version to v0.3.0 (#2656) 2024-01-31 00:07:07 -08:00
Woosuk Kwon
3dad944485
Add quantized mixtral support (#2673) 2024-01-30 16:34:10 -08:00
Woosuk Kwon
105a40f53a
[Minor] Fix false warning when TP=1 (#2674) 2024-01-30 14:39:40 -08:00
Philipp Moritz
bbe9bd9684
[Minor] Fix a small typo (#2672) 2024-01-30 13:40:37 -08:00
Wen Sun
d79ced3292
Fix 'Actor methods cannot be called directly' when using --engine-use-ray (#2664)
* fix: engine-useray complain

* fix: typo
2024-01-30 17:17:05 +01:00
Philipp Moritz
ab40644669
Fused MOE for Mixtral (#2542)
Co-authored-by: chen shen <scv119@gmail.com>
2024-01-29 22:43:37 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star
b72af8f1ed
Fix error when tp > 1 (#2644)
Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>
2024-01-28 22:47:39 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Murali Andoorveedu
89be30fa7d
Small async_llm_engine refactor (#2618) 2024-01-27 23:28:37 -08:00
Woosuk Kwon
5f036d2bcc
[Minor] Fix warning on Ray dependencies (#2630) 2024-01-27 15:43:40 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Xiang Xu
220a47627b
Use head_dim in config if exists (#2622) 2024-01-27 10:30:49 -08:00
Casper
beb89f68b4
AWQ: Up to 2.66x higher throughput (#2566) 2024-01-26 23:53:17 -08:00
Philipp Moritz
390b495ff3
Don't build punica kernels by default (#2605) 2024-01-26 15:19:19 -08:00
dakotamahan-stability
3a0e1fc070
Support for Stable LM 2 (#2598)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-26 12:45:19 -08:00
Hongxia Yang
6b7de1a030
[ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
Junyang Lin
2832e7b9f9
fix names and license for Qwen2 (#2589) 2024-01-24 22:37:51 -08:00
Simon Mo
3a7dd7e367
Support Batch Completion in Server (#2529) 2024-01-24 17:11:07 -08:00
Federico Galatolo
f1f6cc10c7
Added include_stop_str_in_output and length_penalty parameters to OpenAI API (#2562) 2024-01-24 10:21:56 -08:00
Nikola Borisov
3209b49033
[Bugfix] fix crash if max_tokens=None (#2570) 2024-01-23 22:38:55 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Erfan Al-Hossami
9c1352eb57
[Feature] Simple API token authentication and pluggable middlewares (#1106) 2024-01-23 15:13:00 -08:00
Junyang Lin
94b5edeb53
Add qwen2 (#2495) 2024-01-22 14:34:21 -08:00
Philipp Moritz
ab7e6006d6
Fix https://github.com/vllm-project/vllm/issues/2540 (#2545) 2024-01-22 19:02:38 +01:00
Cade Daniel
18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model (#2424) 2024-01-21 16:31:47 -08:00
Jannis Schönleber
71d63ed72e
migrate pydantic from v1 to v2 (#2531) 2024-01-21 16:05:56 -08:00
Nick Hill
d75c40734a
[Fix] Keep scheduler.running as deque (#2523) 2024-01-20 22:36:09 -08:00