FMA optimization for LoopSIMD
The FMA optimization of compilers fail for
LoopSIMD expressions like
x += alpha*b;
if the length of the
LoopSIMD exceeds a compiler dependent size. You can see it in this Godbold example. For gcc-10 the critical size is 7 and for clang-11 it is 28.
The problem also occurs if one uses another SIMD types like
LoopSIMD to build a larger SIMD type.
I see three possible solutions:
Wait for better compilers (I do not know how likely that is)
Exchange the expressions with a function call like
x = fma(alpha,b,x)that does the multiplication and addition interleaved. A fallback for arithmetic types are given by
std::fma. That would require to change a lot of code. However, I think the most important places are in
densevector.hhas alot of other performance relevant things, like operator application and preconditions in dune-istl, build up on this.
We could return an expression template from the
operator*that is evaluated when it is assigned to
LoopSIMDand in the
operator +=it interleaves the multiplication and addition such that the compiler can do the optimization. I already made a first attempt for this: https://gitlab.dune-project.org/nils.dreier/dune-common/-/commits/LoopSIMD_fma_expression_template . But I think it would be difficult to convince the CI machinery for that.
For a SpMV operation in dune-istl, I can observe a speedup up to
@joe I think you have the best overview over the SIMD things. Would be great to hear your opinion.