General matrix multiplication for f32, f64 matrices. Operates on matrices with general layout (they can use arbitrary row and column stride).
Please read the API documentation here
We presently provide a few good microkernels portable and for x86-64, and only one operation: the general matrix-matrix multiplication (gemm).
This crate was inspired by the tmacro/microkernel approach to matrix multiplication that is used by the BLIS project.
cargo bench
is useful for special cases and small matricesexamples/benchmarks.rs
which supports custom sizes,
some configuration, and csv output.
Use the script benches/benchloop.py
to run benchmarks over parameter ranges.0.3.2
cgemm
for complex matmult functions cgemm
and
zgemm
constconf
for compile-time configuration of matrix
kernel parameters for chunking. Improved scripts for benchmarking over ranges
of different settings. With thanks to @DutchGhost for the const-time
parsing functions.0.3.1
&T
when it should have been &[T]
.0.3.0
Implement initial support for threading using a bespoke thread pool with
little contention.
To use, enable feature threading
(and configure number of threads with the
variable MATMUL_NUM_THREADS
).
Initial support is for up to 4 threads - will be updated with more experience in coming versions.
Added a better benchmarking program for arbitrary size and layout, see
examples/benchmark.rs
for this; it supports csv output for better
recording of measurements
Minimum supported rust version is 1.41.1 and the version update policy has been updated.
Updated to Rust 2018 edition
Moved CI to github actions (so long travis and thanks for all the fish).
0.2.4
0.2.3
-Ctarget-cpu=native
use (not recommended -
use automatic runtime feature detection.0.2.2
New dgemm avx and fma kernels implemented by R. Janis Goldschmidt (@SuperFluffy). With fast cases for both row and column major output.
Benchmark improvements: Using fma instructions reduces execution time on dgemm benchmarks by 25-35% compared with the avx kernel, see issue #35
Using the avx dgemm kernel reduces execution time on dgemm benchmarks by 5-7% compared with the previous version's autovectorized kernel.
New fma adaption of the sgemm avx kernel by R. Janis Goldschmidt (@SuperFluffy).
Benchmark improvement: Using fma instructions reduces execution time on sgemm benchmarks by 10-15% compared with the avx kernel, see issue #35
More flexible kernel selection allows kernels to individually set all their parameters, ensures the fallback (plain Rust) kernels can be tuned for performance as well, and moves feature detection out of the gemm loop.
Benchmark improvement: Reduces execution time on various benchmarks by 1-2% in the avx kernels, see #37.
Improved testing to cover input/output strides of more diversity.
0.2.1
Improve matrix packing by taking better advantage of contiguous inputs.
Benchmark improvement: execution time for 6464 problem where inputs are either both row major or both column major changed by -5% sgemm and -1% for dgemm. (#26)
In the sgemm avx kernel, handle column major output arrays just like it does row major arrays.
Benchmark improvement: execution time for 3232 problem where output is column major changed by -11%. (#27)
0.2.0
Use runtime feature detection on x86 and x86-64 platforms, to enable AVX-specific microkernels at runtime if available on the currently executing configuration.
This means no special compiler flags are needed to enable native instruction performance!
Implement a specialized 88 sgemm (f32) AVX microkernel, this speeds up matrix multiplication by another 25%.
Use std::alloc
for allocation of aligned packing buffers
We now require Rust 1.28 as the minimal version
0.1.15
0.1.14
0.1.13
rawpointer
, a crate with raw pointer methods taken from this
project.0.1.12
0.1.11
0.1.10
0.1.9
0.1.8
0.1.7
0.1.6
0.1.5
0.1.4
0.1.3
0.1.2
0.1.1