FMA optimization for LoopSIMD
The FMA optimization of compilers fail for LoopSIMD expressions like
x += alpha*b;
if the length of the LoopSIMD exceeds a compiler dependent size. You can see it in this Godbold example. For gcc-10 the critical size is 7 and for clang-11 it is 28.
The problem also occurs if one uses another SIMD types like Vec8d or Vc::Vector<double> in LoopSIMD to build a larger SIMD type.
I see three possible solutions:
-
Wait for better compilers (I do not know how likely that is)
-
Exchange the expressions with a function call like
x = fma(alpha,b,x)that does the multiplication and addition interleaved. A fallback for arithmetic types are given bystd::fma. That would require to change a lot of code. However, I think the most important places are indensematrix.hhanddensevector.hhas alot of other performance relevant things, like operator application and preconditions in dune-istl, build up on this. -
We could return an expression template from the
operator*that is evaluated when it is assigned toLoopSIMDand in theoperator +=it interleaves the multiplication and addition such that the compiler can do the optimization. I already made a first attempt for this: https://gitlab.dune-project.org/nils.dreier/dune-common/-/commits/LoopSIMD_fma_expression_template . But I think it would be difficult to convince the CI machinery for that.
For a SpMV operation in dune-istl, I can observe a speedup up to 2x.
@joe I think you have the best overview over the SIMD things. Would be great to hear your opinion.