A BLIS-like implementation of GEneral-Matrix-Matrix multiplication (GEMM)

How to build, run the tests and install

There are currently two platform specific µ-kernels. Depending on whether you are on Apple silicon or Intel, adapt the import clauses in com.oracle.la.s.GEMM.java and com.oracle.la.d.GEMM.java.

Then run mvn clean install to build, run the tests and install the artifact to your local maven repository (usually ${HOME}/.m2/repository).

Overview of the BLIS-like implementation

The implementation is based on the discussion in the paper Anatomy of High-Performance Many-Threaded Matrix Multiplication. It performs an update C += A B, where A, B, C are given matrices. Contrary to the title of the paper, however, this implementation is single-threaded.

In essence, the matrices are subdivided in smaller submatrices to perform a block matrix multiplication. The subdivision is done in an attempt to have smaller submatrices in faster layers of the memory hierarchy.

To do so, submatrices of the inputs are copied while reshuffling the content to have contiguous, 1-strided access inside a µ-kernel.

There are 5 nested loops to achieve the desired subdivision, copying and reshuffling. They don't directly perform any matrix computation, but the innermost 5th loop repeatedly invokes a platform specific µ-kernel that does the real work. The µ-kernel makes use of the Vector API to exploit data parallelism for the computations.

Each combined iteration of the outermost 1st and 2nd loops defines the submatrix of B that is copied in a buffer and reshuffled for contiguous access in the µ-kernel. Similarly, each iteration of the 3rd loop defines the submatrix of A that is copied and reshuffled for optimal access in the µ-kernel.

Each combined iteration of 4th and 5th loops defines a small submatrix of C that is subject to the µ-kernel, where it is later loaded into SIMD registers for its update.

The µ-kernel performs the update of the C submatrix in the SIMD registers by a series of rank-1 updates running inside (yet another, 6th) loop. It then stores the submatrix present in the SIMD registers back to the matrix C. This small submatrix of C might be updated again in later iterations.

Ideally, the faster layers (usually L1d, L2 and L3 caches) should be filled with matrix data as much as possible, while leaving some headroom for other tasks. The reason is that a matrix multiplication is typically more memory-bound than CPU-bound. Data closer to the CPU means less stalls due to cache misses.

The caches occupancy is very indirectly controlled by some parameters, in the hope to achieve a high FLOP/s / MEMOP/s rate ratio.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A BLIS-like implementation of GEneral-Matrix-Matrix multiplication (GEMM)

How to build, run the tests and install

Overview of the BLIS-like implementation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

rgiulietti/la-blis

Folders and files

Latest commit

History

Repository files navigation

A BLIS-like implementation of GEneral-Matrix-Matrix multiplication (GEMM)

How to build, run the tests and install

Overview of the BLIS-like implementation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages