squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
omrishiv	3c3012398e	[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron (#6844 ) Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>	2024-07-26 20:20:16 -07:00
Woosuk Kwon	ced36cd89b	[ROCm] Upgrade PyTorch nightly version (#6845 )	2024-07-26 20:16:13 -07:00
Sanger Steel	969d032265	[Bugfix]: Fix Tensorizer test failures (#6835 )	2024-07-26 20:02:25 -07:00
Lucas Wilkinson	55712941e5	[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852 )	2024-07-27 02:27:44 +00:00
Cyrus Leung	981b0d5673	[Frontend] Factor out code for running uvicorn (#6828 )	2024-07-27 09:58:25 +08:00
Woosuk Kwon	d09b94ca58	[TPU] Support collective communications in XLA devices (#6813 )	2024-07-27 01:45:57 +00:00
chenqianfzh	bb5494676f	enforce eager mode with bnb quantization temporarily (#6846 )	2024-07-27 01:32:20 +00:00
Gurpreet Singh Dhami	b5f49ee55b	Update README.md (#6847 )	2024-07-27 00:26:45 +00:00
Zhanghao Wu	150a1ffbfd	[Doc] Update SkyPilot doc for wrong indents and instructions for update service (#4283 )	2024-07-26 14:39:10 -07:00
Michael Goin	281977bd6e	[Doc] Add Nemotron to supported model docs (#6843 )	2024-07-26 17:32:44 -04:00
Li, Jiang	3bbb4936dc	[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation (#6125 )	2024-07-26 13:50:10 -07:00
Woosuk Kwon	aa4867791e	[Misc][TPU] Support TPU in initialize_ray_cluster (#6812 )	2024-07-26 19:39:49 +00:00
Woosuk Kwon	71734f1bf2	[Build/CI][ROCm] Minor simplification to Dockerfile.rocm (#6811 )	2024-07-26 12:28:32 -07:00
Tyler Michael Smith	50704f52c4	[Bugfix][Kernel] Promote another index to int64_t (#6838 )	2024-07-26 18:41:04 +00:00
Michael Goin	07278c37dd	[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611 )	2024-07-26 14:33:42 -04:00
youkaichao	85ad7e2d01	[doc][debugging] add known issues for hangs (#6816 )	2024-07-25 21:48:05 -07:00
Peng Guanwen	89a84b0bb7	[Core] Use array to speedup padding (#6779 )	2024-07-25 21:31:31 -07:00
Anthony Platanios	084a01fd35	[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. (#6770 )	2024-07-25 21:25:35 -07:00
QQSong	062a1d0fab	Fix ReplicatedLinear weight loading (#6793 )	2024-07-25 19:24:58 -07:00
Kevin H. Luu	2eb9f4ff26	[ci] Mark tensorizer as soft fail and separate from grouped test (#6810 ) [ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check (#6810) Signed-off-by: kevin <kevin@anyscale.com>	2024-07-25 18:08:33 -07:00
youkaichao	443c7cf4cf	[ci][distributed] fix flaky tests (#6806 )	2024-07-25 17:44:09 -07:00
SangBin Cho	1adddb14bf	[Core] Fix ray forward_dag error mssg (#6792 )	2024-07-25 16:53:25 -07:00
Woosuk Kwon	b7215de2c5	[Docs] Publish 5th meetup slides (#6799 )	2024-07-25 16:47:55 -07:00
youkaichao	f3ff63c3f4	[doc][distributed] improve multinode serving doc (#6804 )	2024-07-25 15:38:32 -07:00
Lucas Wilkinson	cd7edc4e87	[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors (#6798 )	2024-07-25 15:05:09 -07:00
Kuntai Du	6a1e25b151	[Doc] Add documentations for nightly benchmarks (#6412 )	2024-07-25 11:57:16 -07:00
Tyler Michael Smith	95db75de64	[Bugfix] Add synchronize to prevent possible data race (#6788 ) Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-07-25 10:40:01 -07:00
Michael Goin	65b1f121c8	[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761 )	2024-07-25 09:46:15 -07:00
Robert Shaw	889da130e7	[ Misc ] `fp8-marlin` channelwise via `compressed-tensors` (#6524 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-07-25 09:46:04 -07:00
Alphi	b75e314fff	[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V (#6787 ) Co-authored-by: hezhihui <hzh7269@modelbest.cn> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-25 09:42:49 -07:00
Chang Su	316a41ac1d	[Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755 )	2024-07-24 22:48:07 -07:00
Alexander Matveev	0310029a2f	[Bugfix] Fix awq_marlin and gptq_marlin flags (#6745 )	2024-07-24 22:34:11 -07:00
Cody Yu	309aaef825	[Bugfix] Fix decode tokens w. CUDA graph (#6757 )	2024-07-24 22:33:56 -07:00
Alphi	9e169a4c61	[Model] Adding support for MiniCPM-V (#4087 )	2024-07-24 20:59:30 -07:00
Evan Z. Liu	5689e256ba	[Frontend] Represent tokens with identifiable strings (#6626 )	2024-07-25 09:51:00 +08:00
youkaichao	740374d456	[core][distributed] fix zmq hang (#6759 )	2024-07-24 17:37:12 -07:00
Hongxia Yang	d88c458f44	[Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users (#6754 )	2024-07-24 14:32:57 -07:00
Michael Goin	421e218b37	[Bugfix] Bump transformers to 4.43.2 (#6752 )	2024-07-24 13:22:16 -07:00
Antoni Baum	5448f67635	[Core] Tweaks to model runner/input builder developer APIs (#6712 )	2024-07-24 12:17:12 -07:00
Antoni Baum	0e63494cf3	Add fp8 support to `reshape_and_cache_flash` (#6667 )	2024-07-24 18:36:52 +00:00
Daniele	ee812580f7	[Frontend] split run_server into build_server and run_server (#6740 )	2024-07-24 10:36:04 -07:00
Allen.Dou	40468b13fa	[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. (#6686 )	2024-07-24 08:58:42 -07:00
Nick Hill	2cf0df3381	[Bugfix] Fix speculative decode seeded test (#6743 )	2024-07-24 08:58:31 -07:00
LF Marques	545146349c	Adding f-string to validation error which is missing (#6748 )	2024-07-24 08:55:53 -07:00
liuyhwangyh	f4f8a9d892	[Bugfix]fix modelscope compatible issue (#6730 )	2024-07-24 05:04:46 -07:00
Alexei-V-Ivanov-AMD	b570811706	[Build/CI] Update run-amd-test.sh. Enable Docker Hub login. (#6711 )	2024-07-24 05:01:14 -07:00
Woosuk Kwon	ccc4a73257	[Docs][ROCm] Detailed instructions to build from source (#6680 )	2024-07-24 01:07:23 -07:00
Roger Wang	0a740a11ba	[Bugfix] Fix token padding for chameleon (#6724 )	2024-07-24 01:05:09 -07:00
Nick Hill	c882a7f5b3	[SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714 )	2024-07-24 07:34:22 +00:00
William Lin	5e8ca973eb	[Bugfix] fix flashinfer cudagraph capture for PP (#6708 )	2024-07-24 01:49:44 +00:00

1 2 3 4 5 ...

2091 Commits