FMA optimization for LoopSIMD
The FMA optimization of compilers fail for LoopSIMD
expressions like
x += alpha*b;
if the length of the LoopSIMD
exceeds a compiler dependent size. You can see it in this Godbold example. For gcc-10 the critical size is 7 and for clang-11 it is 28.
The problem also occurs if one uses another SIMD types like Vec8d
or Vc::Vector<double>
in LoopSIMD
to build a larger SIMD type.
I see three possible solutions:
-
Wait for better compilers (I do not know how likely that is)
-
Exchange the expressions with a function call like
x = fma(alpha,b,x)
that does the multiplication and addition interleaved. A fallback for arithmetic types are given bystd::fma
. That would require to change a lot of code. However, I think the most important places are indensematrix.hh
anddensevector.hh
as alot of other performance relevant things, like operator application and preconditions in dune-istl, build up on this. -
We could return an expression template from the
operator*
that is evaluated when it is assigned toLoopSIMD
and in theoperator +=
it interleaves the multiplication and addition such that the compiler can do the optimization. I already made a first attempt for this: https://gitlab.dune-project.org/nils.dreier/dune-common/-/commits/LoopSIMD_fma_expression_template . But I think it would be difficult to convince the CI machinery for that.
For a SpMV operation in dune-istl, I can observe a speedup up to 2x
.
@joe I think you have the best overview over the SIMD things. Would be great to hear your opinion.