squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Woosuk Kwon	200a2ffa6b	[Misc] Refactor Llama3 RoPE initialization (#7637 )	2024-08-18 17:18:12 -07:00
Alex Brooks	40e1360bb6	[CI/Build] Add text-only test for Qwen models (#7475 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-08-19 07:43:46 +08:00
Robert Shaw	e3b318216d	[ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend (#7279 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-18 20:19:48 +00:00
Woosuk Kwon	ab7165f2c7	[TPU] Optimize RoPE forward_native2 (#7636 )	2024-08-18 01:15:10 -07:00
Woosuk Kwon	0c2fa50b84	[TPU] Use mark_dynamic only for dummy run (#7634 )	2024-08-18 00:18:53 -07:00
Woosuk Kwon	ce143353c6	[TPU] Skip creating empty tensor (#7630 )	2024-08-17 14:22:46 -07:00
Roger Wang	bbf55c4805	[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530 )	2024-08-17 13:30:55 -07:00
Jee Jee Li	1ef13cf92f	[Misc]Fix BitAndBytes exception messages (#7626 )	2024-08-17 12:02:14 -07:00
youkaichao	832163b875	[ci][test] allow longer wait time for api server (#7629 )	2024-08-17 11:26:38 -07:00
Besher Alkurdi	e73f76eec6	[Model] Pipeline parallel support for JAIS (#7603 )	2024-08-17 11:11:09 -07:00
youkaichao	d95cc0a55c	[core][misc] update libcudart finding (#7620 ) Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com>	2024-08-16 23:01:35 -07:00
youkaichao	5bf45db7df	[ci][test] fix engine/logger test (#7621 )	2024-08-16 23:00:59 -07:00
youkaichao	eed020f673	[misc] use nvml to get consistent device name (#7582 )	2024-08-16 21:15:13 -07:00
Xander Johnson	7c0b7ea214	[Bugfix] add >= 1.0 constraint for openai dependency (#7612 )	2024-08-16 20:56:01 -07:00
SangBin Cho	4706eb628e	[aDAG] Unflake aDAG + PP tests (#7600 )	2024-08-16 20:49:30 -07:00
Rui Qiao	bae888cb8e	[Bugfix] Clear engine reference in AsyncEngineRPCServer (#7618 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-16 20:44:05 -07:00
Alexei-V-Ivanov-AMD	6bd19551b0	.[Build/CI] Enabling passing AMD tests. (#7610 )	2024-08-16 20:25:32 -07:00
bnellnm	e680349994	[Bugfix] Fix custom_ar support check (#7617 )	2024-08-16 19:05:49 -07:00
Michael Goin	44f26a9466	[Model] Align nemotron config with final HF state and fix lm-eval-small (#7611 )	2024-08-16 15:56:34 -07:00
bnellnm	37fd47e780	[Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596 )	2024-08-16 14:00:11 -07:00
bnellnm	7759ae958f	[Kernel][Misc] dynamo support for ScalarType (#7594 )	2024-08-16 13:59:49 -07:00
bnellnm	9f69856356	[Kernel] register punica functions as torch ops (#7591 )	2024-08-16 13:59:38 -07:00
Michael Goin	d4f0f17b02	[Doc] Update quantization supported hardware table (#7595 )	2024-08-16 13:59:27 -07:00
Michael Goin	b3f4e17935	[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444 )	2024-08-16 13:59:16 -07:00
Mahesh Keralapura	93478b63d2	[Core] Fix tracking of model forward time in case of PP>1 (#7440 ) [Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)	2024-08-16 13:46:01 -07:00
William Lin	f366f6339b	[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-16 11:41:56 -07:00
Michael Goin	855866caa9	[Kernel] Add tuned triton configs for ExpertsInt8 (#7601 )	2024-08-16 11:37:01 -07:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Charlie Fu	e837b624f2	[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210 )	2024-08-16 10:06:30 -07:00
fzyzcjy	ec724a725e	support tqdm in notebooks (#7510 )	2024-08-16 09:17:50 -07:00
Gordon Wong	0e39a33c6d	[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method (#7513 )	2024-08-16 10:05:18 -06:00
Kuntai Du	6fc5b0f249	[CI] Fix crashes of performance benchmark (#7500 )	2024-08-16 08:08:45 -07:00
Nick Hill	9587b050fb	[Core] Use uvloop with zmq-decoupled front-end (#7570 )	2024-08-15 22:48:07 -07:00
youkaichao	54bd9a03c4	register custom op for flash attn and use from torch.ops (#7536 )	2024-08-15 22:38:56 -07:00
jon-chuang	50b8d08dbd	[Misc/Testing] Use `torch.testing.assert_close` (#7324 )	2024-08-16 04:24:04 +00:00
Michael Goin	e165528778	[CI] Move quantization cpu offload tests out of fastcheck (#7574 )	2024-08-15 21:16:20 -07:00
nunjunj	3b19e39dc5	Chat method for offline llm (#5049 ) Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal> Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal> Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-08-15 19:41:34 -07:00
youkaichao	4cd7d47fed	[ci/test] rearrange tests and make adag test soft fail (#7572 )	2024-08-15 19:39:04 -07:00
Grant Pinkert	f878c8feb0	[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453 )	2024-08-16 02:38:08 +00:00
shangmingc	b67ae00cdb	[Misc] Add quantization config support for speculative model. (#7343 )	2024-08-15 19:34:28 -07:00
Michael Goin	9c8e2d1161	[Bugfix][Harmless] Fix float16 dtype for model_is_embedding (#7566 )	2024-08-15 18:26:19 -07:00
Michael Goin	21313e09e3	[Bugfix] Fix default weight loading for scalars (#7534 )	2024-08-15 13:10:22 -07:00
PHILO-HE	f4da5f7b6d	[Misc] Update dockerfile for CPU to cover protobuf installation (#7182 )	2024-08-15 10:03:01 -07:00
omrishiv	9c1f78d5d6	[Bugfix] update neuron for version > 0.5.0 (#7175 ) Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-15 09:44:14 -07:00
Woosuk Kwon	fc93e56143	[Bugfix][TPU] Correct env variable for XLA cache path (#7544 )	2024-08-15 00:02:29 -07:00
Kameshwara Pavan Kumar Mantha	22b39e11f2	llama_index serving integration documentation (#6973 ) Co-authored-by: pavanmantha <pavan.mantha@thevaslabs.io>	2024-08-14 15:38:37 -07:00
Kyle Sayers	f55a9aea45	[Misc] Revert `compressed-tensors` code reuse (#7521 )	2024-08-14 15:07:37 -07:00
Woosuk Kwon	951fdd66d3	[TPU] Set per-rank XLA cache (#7533 )	2024-08-14 14:47:51 -07:00
William Lin	2ecf7b1757	[core] [3/N] multi-step args and sequence.py (#7452 )	2024-08-14 12:32:45 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00

1 2 3 4 5 ...

2417 Commits