Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
Gregory Shtrasberg
9984605412
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility ( #7477 )
...
Co-authored-by: Charlie Fu <Charlie.Fu@amd.com>
2024-08-21 16:47:36 -07:00
Dipika Sikka
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-21 16:17:10 -07:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
bnellnm
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo ( #7596 )
2024-08-16 14:00:11 -07:00
bnellnm
7759ae958f
[Kernel][Misc] dynamo support for ScalarType ( #7594 )
2024-08-16 13:59:49 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
Lucas Wilkinson
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels ( #7323 )
2024-08-12 14:40:13 -04:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Tyler Michael Smith
6e4852ce28
[CI/Build] Suppress divide-by-zero and missing return statement warnings ( #7001 )
2024-08-05 16:00:01 -04:00
Tyler Michael Smith
8571ac4672
[Kernel] Update CUTLASS to 3.5.1 ( #7085 )
2024-08-05 15:13:43 -04:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
Jee Jee Li
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
Varun Sundar Rabindranath
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) ( #6996 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath
93548eb37e
[Kernel] Enable FP8 Cutlass for Ada Lovelace ( #6950 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:22 -07:00
HandH1998
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
Tyler Michael Smith
cbbc904470
[Kernel] Squash a few more warnings ( #6914 )
2024-07-30 13:50:42 -04:00
Varun Sundar Rabindranath
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace ( #6848 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 20:24:58 -06:00
Tyler Michael Smith
61a97c32f6
[Kernel] Fix marlin divide-by-zero warnings ( #6904 )
2024-07-30 01:26:07 +00:00
Tyler Michael Smith
aae6d36f7e
[Kernel] Remove unused variables in awq/gemm_kernels.cu ( #6908 )
2024-07-29 18:01:17 -06:00
Tyler Michael Smith
60d1c6e584
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel ( #6901 )
2024-07-29 09:59:02 -07:00
Varun Sundar Rabindranath
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace ( #6677 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
Joe
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
Lucas Wilkinson
55712941e5
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b ( #6852 )
2024-07-27 02:27:44 +00:00
Li, Jiang
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
Tyler Michael Smith
50704f52c4
[Bugfix][Kernel] Promote another index to int64_t ( #6838 )
2024-07-26 18:41:04 +00:00
Antoni Baum
0e63494cf3
Add fp8 support to reshape_and_cache_flash ( #6667 )
2024-07-24 18:36:52 +00:00
Tyler Michael Smith
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
Varun Sundar Rabindranath
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
2024-07-19 18:15:26 -07:00
Varun Sundar Rabindranath
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-18 01:38:35 +00:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale ( #6081 )
2024-07-16 15:31:32 -07:00
Tyler Michael Smith
9dad5cc859
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace ( #6384 )
2024-07-14 13:37:19 +00:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
Joe Runde
ba4994443a
[Kernel] Add punica dimensions for Granite 3b and 8b ( #5930 )
...
Signed-off-by: Joe Runde <joe@joerun.de>
2024-06-29 10:48:25 +08:00
Tyler Michael Smith
5d2a1a9cf0
Unmark more files as executable ( #5962 )
2024-06-28 17:34:56 -04:00
Tyler Michael Smith
6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues ( #5931 )
2024-06-28 17:10:34 +00:00
Chip Kerchner
38a1674abb
Support CPU inference with VSX PowerPC ISA ( #5652 )
2024-06-26 21:53:04 +00:00
Luka Govedič
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm ( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-06-26 15:16:00 +00:00
Varun Sundar Rabindranath
6c916ac8a8
[BugFix] [Kernel] Add Cutlass2x fallback kernels ( #5744 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-23 21:07:11 +00:00
Roger Wang
bd620b01fb
[Kernel][CPU] Add Quick gelu to CPU ( #5717 )
2024-06-21 06:39:40 +00:00
Jinzhen Lin
1f5674218f
[Kernel] Add punica dimension for Qwen2 LoRA ( #5441 )
2024-06-20 17:55:41 -07:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
Varun Sundar Rabindranath
a7dcc62086
[Kernel] Update Cutlass int8 kernel configs for SM80 ( #5275 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 13:33:21 +00:00
Roger Wang
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
Varun Sundar Rabindranath
111af1fa2c
[Kernel] Update Cutlass int8 kernel configs for SM90 ( #5514 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 06:37:08 +00:00
Tyler Michael Smith
b23ce92032
[Bugfix] Fix CUDA version check for mma warning suppression ( #5642 )
2024-06-18 23:48:49 +00:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
Joe Runde
5002175e80
[Kernel] Add punica dimensions for Granite 13b ( #5559 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-06-18 03:54:11 +00:00
Tyler Michael Smith
348616ac4b
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later ( #5401 )
2024-06-14 10:02:00 -07:00
Tyler Michael Smith
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue ( #5516 )
2024-06-14 09:30:15 -07:00
Jie Fu (傅杰)
cd9c0d65d9
[Hardware][Intel] Support CPU inference with AVX2 ISA ( #5452 )
2024-06-13 17:22:24 -06:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Cody Yu
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
Jie Fu (傅杰)
6840a71610
[Misc] Remove unused cuda_utils.h in CPU backend ( #5345 )
2024-06-07 14:09:13 -07:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
Tyler Michael Smith
ccd4f129e8
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size ( #5157 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-05 10:44:15 -07:00
Yuan
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
Tyler Michael Smith
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
Divakar Verma
a66cf40b20
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer ( #4927 )
...
This PR enables the fused topk_softmax kernel used in moe layer for HIP
2024-06-02 14:13:26 -07:00
Varun Sundar Rabindranath
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-01 08:46:07 +00:00
Tyler Michael Smith
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
Tyler Michael Smith
1197e02141
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels ( #5168 )
2024-05-31 17:21:38 -07:00
Simon Mo
e9d3aa04f6
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" ( #5149 )
2024-05-30 22:00:26 -07:00
SnowDist
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-30 19:24:41 -07:00
Alexander Matveev
6d21fa1cad
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) ( #5136 )
2024-05-30 21:02:11 -05:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Alexander Matveev
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
raywanb
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
Tyler Michael Smith
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
Michael Goin
5f6d10c14c
[CI/Build] Enforce style for C++ and CUDA code with clang-format ( #4722 )
2024-05-22 07:18:41 +00:00
Alexander Matveev
da5a0b539d
Remove marlin warning ( #4918 )
2024-05-20 14:55:34 +00:00
Tyler Michael Smith
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
Silencio
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
2024-05-16 11:16:09 -07:00
Alexander Matveev
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-05-16 12:56:15 -04:00
Jinzhen Lin
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
Steve Grubb
dac6a3f6ed
[Misc] Apply a couple g++ cleanups ( #4719 )
2024-05-10 13:37:05 +00:00
Cody Yu
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
kliuae
ff5abcd746
[ROCm] Add support for Punica kernels on AMD GPUs ( #3140 )
...
Co-authored-by: miloice <jeffaw99@hotmail.com>
2024-05-09 09:19:50 -07:00
alexm-nm
e288df0632
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin ( #4626 )
2024-05-08 17:14:31 -07:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
youkaichao
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
Philipp Moritz
a98187cf72
[Kernel] Make static FP8 scaling more robust ( #4570 )
...
Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint
https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale
(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k ), I'm getting the following mostly random performance on MMLU:
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.2295|± |0.0035|
| - humanities |N/A |none | 5|acc |0.2421|± |0.0062|
| - other |N/A |none | 5|acc |0.2398|± |0.0076|
| - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074|
| - stem |N/A |none | 5|acc |0.2125|± |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7008|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6453|± |0.0065|
| - other |N/A |none | 5|acc |0.7692|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070|
| - stem |N/A |none | 5|acc |0.6115|± |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
2024-05-06 17:39:28 -07:00
Lily Liu
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
2024-05-03 15:51:27 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
alexm-nm
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
Austin Veselka
eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers ( #3524 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Philipp Moritz
12628d3c78
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales ( #4343 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-27 04:49:59 +00:00
alexm-nm
aae08249ac
[Bugfix] Fix marlin kernel crash on H100 ( #4218 )
...
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
2024-04-24 10:35:01 -07:00
Woosuk Kwon
468d761b32
[Misc] Reduce supported Punica dtypes ( #4304 )
2024-04-23 18:54:33 -07:00
Philipp Moritz
eace8bf0b9
[Kernel] FP8 support for MoE kernel / Mixtral ( #4244 )
...
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208
It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118 ), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954 ). With this PR, the results are as follows:
<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03 ">
**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7018|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6472|± |0.0065|
| - other |N/A |none | 5|acc |0.7673|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070|
| - stem |N/A |none | 5|acc |0.6131|± |0.0083|
```
this compares favorably with the fp16 results which are
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7020|± |0.1313|
| - humanities |N/A |none | 5|acc |0.6425|± |0.1349|
| - other |N/A |none | 5|acc |0.7744|± |0.1038|
| - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695|
| - stem |N/A |none | 5|acc |0.6108|± |0.1383|
```
Happy hacking!
2024-04-24 01:18:23 +00:00
James Fleming
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Shoichi Uchinami
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA ( #4134 )
2024-04-17 10:02:45 -07:00
Jee Li
989ae2538d
[Kernel] Add punica dimension for Baichuan-13B ( #4053 )
2024-04-13 07:55:05 -07:00