8000 GitHub - rgiulietti/la-blis
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

rgiulietti/la-blis

8000

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A BLIS-like implementation of GEneral-Matrix-Matrix multiplication (GEMM)

How to build, run the tests and install

There are currently two platform specific µ-kernels. Depending on whether you are on Apple silicon or Intel, adapt the import clauses in com.oracle.la.s.GEMM.java and com.oracle.la.d.GEMM.java.

Then run mvn clean install to build, run the tests and install the artifact to your local maven repository (usually ${HOME}/.m2/repository).

Overview of the BLIS-like implementation

The implementation is based on the discussion in the paper Anatomy of High-Performance Many-Threaded Matrix Multiplication. It performs an update C += A B, where A, B, C are given matrices. Contrary to the title of the paper, however, this implementation is single-threaded.

In essence, the matrices are subdivided in smaller submatrices to perform a block matrix multiplication. The subdivision is done in an attempt to have smaller submatrices in faster layers of the memory hierarchy.

To do so, submatrices of the inputs are copied while reshuffling the content to have contiguous, 1-strided access inside a µ-kernel.

There are 5 nested loops to achieve the desired subdivision, copying and reshuffling. They don't directly perform any matrix computation, but the innermost 5th loop repeatedly invokes a platform specific µ-kernel that does the real work. The µ-kernel makes use of the Vector API to exploit data parallelism for the computations.

Each combined iteration of the outermost 1st and 2nd loops defines the submatrix of B that is copied in a buffer and reshuffled for contiguous access in the µ-kernel. Similarly, each iteration of the 3rd loop defines the submatrix of A that is copied and reshuffled for optimal access in the µ-kernel.

Each combined iteration of 4th and 5th loops defines a small submatrix of C that is subject to the µ-kernel, where it is later loaded into SIMD registers for its update.

The µ-kernel performs the update of the C submatrix in the SIMD registers by a series of rank-1 updates running inside (yet another, 6th) loop. It then stores the submatrix present in the SIMD registers back to the matrix C. This small submatrix of C might be updated again in later iterations.

Ideally, the faster layers (usually L1d, L2 and L3 caches) should be filled with matrix data as much as possible, while leaving some headroom for other tasks. The reason is that a matrix multiplication is typically more memory-bound than CPU-bound. Data closer to the CPU means less stalls due to cache misses.

The caches occupancy is very indirectly controlled by some parameters, in the hope to achieve a high FLOP/s / MEMOP/s rate ratio.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0