Commit Graph

  • c055747867
    [model][utils] add extract_layer_index utility function (#10599) main youkaichao 2024-11-23 22:22:54 -0800
  • eda2b3589c
    Revert "Print running script to enhance CI log readability" (#10601) youkaichao 2024-11-23 21:31:47 -0800
  • 1c445dca51
    [CI/Build] Print running script to enhance CI log readability (#10594) Jee Jee Li 2024-11-24 11:57:13 +0800
  • 1700c543a5
    [Bugfix] Fix LoRA weight sharding (#10450) Jee Jee Li 2024-11-24 09:23:17 +0800
  • 17d8fc1806
    [bugfix] Fix example/tensorize_vllm_model tests (#10595) Jee Jee Li 2024-11-24 09:22:33 +0800
  • 04668ebe7a
    [Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593) Isotr0py 2024-11-24 02:12:20 +0800
  • 651f6c31ac
    For ppc64le, disabled tests for now and addressed space issues (#10538) Nishidha 2024-11-23 15:03:53 +0530
  • 86a44fb896
    [Platforms] Refactor openvino code (#10573) JiHuazhong 2024-11-23 14:23:12 +0800
  • 4cfe5d2bca
    [Bugfix] multi_modal_kwargs broadcast for CPU tensor parallel (#10541) Isotr0py 2024-11-23 13:25:46 +0800
  • c8acd80548
    [2/N] handling placeholders in merged multi-modal processor (#10485) Cyrus Leung 2024-11-23 13:25:09 +0800
  • 4634a89d18
    Prefix Cache Aware Scheduling [1/n] (#10128) Ricky Xu 2024-11-22 21:15:55 -0800
  • 7c25fe45a6
    [AMD] Add support for GGUF quantization on ROCm (#10254) kliuae 2024-11-23 13:14:49 +0800
  • 02a43f82a9
    Update default max_num_batch_tokens for chunked prefill to 2048 (#10544) Michael Goin 2024-11-23 00:14:19 -0500
  • cfea9c04ef
    [Model] Fix Baichuan BNB online quantization (#10572) Chen Wu 2024-11-23 13:13:59 +0800
  • 7d8ffb344f
    [Bugfix] Internal Server Error when tool_choice is incorrect. (#10567) Varun Vinayak Shenoy 2024-11-22 21:13:29 -0800
  • 4aba6e3d1a
    [core] gemma2 full context length support (#10584) youkaichao 2024-11-22 20:13:54 -0800
  • 978b39744b
    [Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432) Tyler Michael Smith 2024-11-22 22:14:03 -0500
  • ebda51968b
    [Core] Fix broken log configuration (#10458) Russell Bryant 2024-11-22 21:23:51 -0500
  • 9195dbdbca
    [Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164) Travis Johnson 2024-11-22 19:17:38 -0700
  • d559979c54
    [bugfix] fix cpu tests (#10585) youkaichao 2024-11-22 17:34:03 -0800
  • d345f409b7
    [V1] EngineCore supports profiling (#10564) Zhonghua Deng 2024-11-23 09:16:15 +0800
  • 28598f3939
    [Core] remove temporary local variables in LLMEngine.__init__ (#10577) Russell Bryant 2024-11-22 19:22:53 -0500
  • 948c859571
    support bitsandbytes quantization with qwen model (#10549) zixuanzhang226 2024-11-22 16:16:14 -0800
  • 97814fbf0f
    [v1] Refactor KVCacheManager for more hash input than token ids (#10507) Ricky Xu 2024-11-22 15:27:25 -0800
  • eebad39f26
    [torch.compile] support all attention backends (#10558) youkaichao 2024-11-22 14:04:42 -0800
  • db100c5cde
    [bugfix] fix full graph tests (#10581) youkaichao 2024-11-22 10:02:14 -0800
  • 11fcf0e066
    Remove token-adding chat embedding params (#10551) Noam Gat 2024-11-22 09:59:47 +0200
  • b6374e09b0
    [Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948) Isotr0py 2024-11-22 15:01:56 +0800
  • a111d0151f
    [platforms] absorb worker cls difference into platforms folder (#10555) youkaichao 2024-11-21 21:00:32 -0800
  • 446c7806b2
    [Minor] Fix line-too-long (#10563) Woosuk Kwon 2024-11-21 19:40:40 -0800
  • 33e0a2540a
    [9/N] torch.compile LLM usage (#10552) youkaichao 2024-11-21 19:13:31 -0800
  • aed074860a
    [Benchmark] Add new H100 machine (#10547) Simon Mo 2024-11-21 18:27:20 -0800
  • 9afa014552
    Add small example to metrics.rst (#10550) Michael Goin 2024-11-21 18:43:43 -0500
  • 46fe9b46d8
    [Minor] Revert change in offline inference example (#10545) Woosuk Kwon 2024-11-21 13:28:16 -0800
  • cf656f5a02
    [misc] improve error message (#10553) youkaichao 2024-11-21 13:13:17 -0800
  • edec3385b6
    [CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535) Yunmeng 2024-11-22 05:03:58 +0800
  • f9310cbd0c
    [V1] Fix Compilation config & Enable CUDA graph by default (#10528) Woosuk Kwon 2024-11-21 12:53:39 -0800
  • 7560ae5caf
    [8/N] enable cli flag without a space (#10529) youkaichao 2024-11-21 12:30:42 -0800
  • e7a8341c7c
    [Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536) Cyrus Leung 2024-11-22 02:09:43 +0800
  • c51e397fe8
    [Misc] Suppress duplicated logging regarding multimodal input pipeline (#10530) Roger Wang 2024-11-21 09:21:31 -0800
  • 2385b60d83
    [Kernel] Register punica ops directly (#10522) Jee Jee Li 2024-11-22 01:18:11 +0800
  • da7e702c6f
    [Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180) Chauncey 2024-11-22 00:24:32 +0800
  • 4d676f0852
    [Bugfix] Embedding model pooling_type equals ALL and multi input's bug (#10494) Xiaoyu Zhang 2024-11-21 22:40:02 +0800
  • d5ec121f95
    [Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models (#10518) Isotr0py 2024-11-21 22:20:08 +0800
  • 8a93a598d9
    fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (#10524) Wang, Yi 2024-11-21 19:15:36 +0800
  • 1cfde82ffd
    [Model] Add Support for Multimodal Granite Models (#10291) Alex Brooks 2024-11-21 03:46:20 -0700
  • f0e0238016
    [Doc] fix a small typo in docstring of llama_tool_parser (#10513) Zhong Qishuai 2024-11-21 17:05:23 +0800
  • aaddce5d26
    [platforms] improve error message for unspecified platforms (#10520) youkaichao 2024-11-20 23:07:56 -0800
  • 3430857b64
    [Misc] Increase default video fetch timeout (#10495) Cyrus Leung 2024-11-21 15:06:42 +0800
  • 8b0fe06c89
    [torch.compile] Inductor code caching fix (#10273) Luka Govedič 2024-11-21 00:44:57 -0500
  • 9d827170a3
    [Platforms] Add device_type in Platform (#10508) Mengqing Cao 2024-11-21 12:44:20 +0800
  • 6c1208d083
    [Core] Add Sliding Window Support with Flashinfer (#10462) Pavani Majety 2024-11-20 19:56:47 -0800
  • 388ee3de66
    [torch.compile] limit inductor threads and lazy import quant (#10482) youkaichao 2024-11-20 18:36:33 -0800
  • 2f77b6cfec
    [TPU] Implement prefix caching for TPUs (#10307) Woosuk Kwon 2024-11-20 13:54:15 -0800
  • c68f7ede6a
    [Bugfix]: allow extra fields in requests to openai compatible server (#10463) Guillaume Calmettes 2024-11-20 22:42:21 +0100
  • 0cd3d9717e
    [7/N] torch.compile, reduce compilation time (#10460) youkaichao 2024-11-20 11:20:38 -0800
  • 5f1d6af2b6
    [perf bench] H200 development (#9768) Simon Mo 2024-11-20 11:06:56 -0800
  • 772a66732d
    [platforms] restore xpu check for parallel config (#10479) youkaichao 2024-11-20 09:13:28 -0800
  • 63f1fde277
    [Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355) Li, Jiang 2024-11-20 18:57:39 +0800
  • d5b28447e0
    [Platforms] Refactor xpu code (#10468) Mengqing Cao 2024-11-20 14:52:13 +0800
  • 09dbf9ff16
    [Bugfix] Handle conflicts between modern and legacy fields (#10471) Cyrus Leung 2024-11-20 14:45:08 +0800
  • 343041c4c4
    [model] Reduce medusa weight (#10454) Sky Lee 2024-11-20 14:05:55 +0800
  • ed701ca963
    [ci/build] Combine nightly and optional (#10465) Kevin H. Luu 2024-11-19 19:36:03 -1000
  • 7629a9c6e5
    [CI/Build] Support compilation with local cutlass path (#10423) (#10424) wchen61 2024-11-20 13:35:50 +0800
  • 709c9f1f25
    [CI/Build] Add sphinx/rst linter for docs (#10366) Rafael Vasquez 2024-11-20 00:35:31 -0500
  • b4be5a8adb
    [Bugfix] Enforce no chunked prefill for embedding models (#10470) Cyrus Leung 2024-11-20 13:12:51 +0800
  • ad44437ba3
    [Bugfix] Fix Mamba model initialization and MLP Speculator weights loading (#10456) Isotr0py 2024-11-20 13:04:05 +0800
  • 9e05252b46
    [Misc] Add __setitem__ for LazyDict (#10469) Yanyi Liu 2024-11-20 12:44:57 +0800
  • d200972e7f
    [Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464) Lucas Wilkinson 2024-11-19 22:40:33 -0500
  • d5b68aba2f
    [CI/Build] Update Dockerfile.rocm (#10434) Alexei-V-Ivanov-AMD 2024-11-19 19:19:59 -0600
  • a324d3a1a7
    Change granite chat template to keep json list formatting for tool calls (#10452) Maximilien de Bayser 2024-11-19 22:16:54 -0300
  • b00b33d77e
    [Model][Quantization] HQQ support through Marlin kernel expansion (#9766) ElizaWszola 2024-11-19 22:31:12 +0100
  • efa9084628
    [Core] Avoid metrics log noise when idle (#8868) Russell Bryant 2024-11-19 16:05:25 -0500
  • 803f37eaaa
    [6/N] torch.compile rollout to users (#10437) youkaichao 2024-11-19 10:09:03 -0800
  • fd9f124971
    [Doc] fix link for page that was renamed (#10455) Russell Bryant 2024-11-19 12:48:30 -0500
  • 1ea291a417
    Fix: Build error seen on Power Architecture (#10421) Manjul Mohan 2024-11-19 23:04:57 +0530
  • 11fd7ea639
    [Pixtral-Large] Pixtral actually has no bias in vision-lang adapter (#10449) Patrick von Platen 2024-11-19 18:33:06 +0100
  • f028dff33d
    [BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) (#10398) COSMOPlat 2024-11-19 21:42:50 +0800
  • b4614656b8
    [CI][CPU] adding numa node number as container name suffix (#10441) Yuan 2024-11-19 21:16:43 +0800
  • 25f9c78961
    [misc][plugin] improve plugin loading (#10443) youkaichao 2024-11-19 02:43:21 -0800
  • 5390d6664f
    [Doc] Add the start of an arch overview page (#10368) Russell Bryant 2024-11-19 04:52:11 -0500
  • 382b6a4852
    [Misc] Avoid misleading warning messages (#10438) Jee Jee Li 2024-11-19 16:54:58 +0800
  • 272e31c0bd
    [Bugfix] Guard for negative counter metrics to prevent crash (#10430) Travis Johnson 2024-11-18 21:57:10 -0700
  • 74f8c2cf5f
    Add openai.beta.chat.completions.parse example to structured_outputs.rst (#10433) Michael Goin 2024-11-18 23:37:46 -0500
  • 8c1fb50705
    [Platform][Refactor] Extract func get_default_attn_backend to Platform (#10358) Mengqing Cao 2024-11-19 11:22:26 +0800
  • 7eb719df13
    [Bugfix]Fix Phi-3 BNB online quantization (#10417) Jee Jee Li 2024-11-19 11:21:42 +0800
  • 284203f171
    [ci/build] Have dependabot ignore all patch update (#10436) Kevin H. Luu 2024-11-18 15:04:25 -1000
  • 90a6c759ca
    [misc] partial prefix & random input generation benchmark (#9929) Ricky Xu 2024-11-18 15:39:14 -0800
  • 2298e69b5f
    [ci][bugfix] fix kernel tests (#10431) youkaichao 2024-11-18 15:29:37 -0800
  • a03ea40792
    [3/N][torch.compile] consolidate custom op logging (#10399) youkaichao 2024-11-18 15:14:59 -0800
  • 96d999fbe8
    [Kernel] Initial Machete W4A8 support + Refactors (#9855) Lucas Wilkinson 2024-11-18 14:59:29 -0500
  • c2170a5b39
    [Kernel] Explicitly specify other value in tl.load calls (#9014) Angus Wang 2024-11-18 11:39:40 -0800
  • 6b2d25efc7
    [Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107) Yan Ma 2024-11-19 02:18:05 +0800
  • 281cc4b3cd
    [Model][Bugfix] Support TP for PixtralHF ViT (#10405) Michael Goin 2024-11-18 13:04:14 -0500
  • 4f686d139f
    Fix open_collective value in FUNDING.yml (#10426) Andrew Nesbitt 2024-11-18 17:52:42 +0000
  • 31894a2155
    [Doc] Add documentation for Structured Outputs (#9943) ismael-dm 2024-11-18 18:52:12 +0100
  • 7851b45196
    [5/N][torch.compile] torch.jit.script --> torch.compile (#10406) youkaichao 2024-11-18 07:20:06 -0800
  • 4186be8111
    [Doc] Update doc for LoRA support in GLM-4V (#10425) B-201 2024-11-18 23:08:30 +0800
  • e7ebb662d7
    [Model] Remove transformers attention porting in VITs (#10414) Isotr0py 2024-11-18 21:45:21 +0800
  • 5be4e52b65
    [Model][LoRA]LoRA support added for glm-4v (#10418) B-201 2024-11-18 20:57:10 +0800