Author: Michael Wolfe, Compiler Engineer, The Portland Group, Inc.
Source: HPCWire
Matmul is a highly parallel algorithm, but let me emphasize that parallelism does not equate to performance. We need to carefully sculpt our algorithm to match the parallelism available in the architecture in order to reap the benefits. This is true whether we are targeting a GPU, a multicore x64, or even a single core with packed SSE operations. As an example, I took the simple matmul loop (in C, but with the matrices stored column-major):
for( int j = 0; j < m; ++j )
for( int k = 0; k < p; ++k )
for( int i = 0; i < n; ++i )
a[i+pitch_a*j] += b[i+pitch_b*k] * c[k+pitch_c*j];
modified it several ways and ran it on an Intel Xeon (3GHz, 6MB cache, 16GB memory, Penryn) using 4096×4096 matrices (to compare with results we’ll see below). With the loop in the order shown (stride-1 inner loop), the program ran at 1.7 GFLOPs; this is compiled C performance (using pgcc -fast). We can improve that by tiling or blocking the loops, organizing the matmul as a a bunch of submatrix multiplications, sized so each submatrix matmul fits in the processor cache. This improves performance to 5.7 GFLOPs, and it jumps to over 22 GFLOPs when we use OpenMP directives and run on all four cores. Advanced compilers help by automatically managing the vectorization, unrolling, memory alignments, adding prefetch instructions, and so forth.
Related Links
Compilers and More: Programming GPUs Today
Compilers and More: GPU Architecture and Applications
Compilers and More: Parallel Programming Made Easy?


