MTL 4: Performance on an AMD Athlon 2GHz
The first benchmark is the product of a dense matrix with its transposed, i.e. A * trans(A):
This operation is favorable for row-major matrices because they are processed with stride 1. As a consequence the C and C++ implementations perform well for large matrices compared with the Fortran implementation (where both arguments are traversed with long strides). The MTL4 implementation is even less affected by the matrix size thanks to the recursive approach.
The implemenation uses tiling on block-level (typically 64 by 64). For the considered processor a tiling of 2 by 4 yields the performance while processors with more available FP registers (e.g. PowerPC) are faster with 4 by 4 tiling. The metaprogramming tuning in MTL4 allows the user to define these parameters in type definitions of functors and the unrolled implementation is generated at compile time.
In this measurement, the benchmark was compiled without -DMTL_HAS_BLAS (/DMTL_HAS_BLAS on MSVC). If we had enabled BLAS in MTL4, the two curves would have been identical.
The second example transposes the first argument in the dense matrix product, i.e. trans(A) * A. This operation is correspondingly more appropriate for column-major matrices so that the Fortran implementation scales better than the C/C++ codes:
As for MTL4, the performance is decreased as well with respect to the first benchmark but the effect is limited due to the recursive implementation.
Multiplying matrices of the same orientation without transposition, i.e. A * A, scales poorly for row-major and column-major if no blocking is used:
As for the previous measurements, the nested blocking of GotoBLAS and the recursive blocking of MTL4 cope with the locality problems of large matrices. In this plot, we also compare with the performance of using recursive matrix formats. The trend is similar to traditional row-major layout but the performance behaves more stably. While row-major matrices with strides that are large powers of two introduce a fair amount of cache conflicts the improved locality of the recursive layout minimizes such conflicts.
The following benchmark considers a different operation, which is x= alpha * y + beta * z with alpha and beta scalars and x, y, z vectors.
Most modern libraries use expression templates for this calculation so that all operations are performed in one single loop.
Finally, we managed outperforming GotoBLAS in one function at least for some sizes:
The dot product in this plot used internally unrolling with block size 8. Please note that this is compiler generated code not unrolled by hand.
Thanks to Laurent Plagne for his support with the BTL and to Chris Cole for running the programs on a cluster node at Indiana University.
Performance on an AMD Athlon 2GHz -- MTL 4 -- Peter Gottschling and Andrew Lumsdaine
-- Generated on 24 Aug 2009 by Doxygen 1.5.9 -- Copyright 2008-09 by TU Dresden and the Trustees of Indiana University.