FMA optimization for LoopSIMD

The FMA optimization of compilers fail for LoopSIMD expressions like

x += alpha*b;

if the length of the LoopSIMD exceeds a compiler dependent size. You can see it in this Godbold example. For gcc-10 the critical size is 7 and for clang-11 it is 28. The problem also occurs if one uses another SIMD types like Vec8d or Vc::Vector<double> in LoopSIMD to build a larger SIMD type.

I see three possible solutions:

Wait for better compilers (I do not know how likely that is)
Exchange the expressions with a function call like x = fma(alpha,b,x) that does the multiplication and addition interleaved. A fallback for arithmetic types are given by std::fma. That would require to change a lot of code. However, I think the most important places are in densematrix.hh and densevector.hh as alot of other performance relevant things, like operator application and preconditions in dune-istl, build up on this.
We could return an expression template from the operator* that is evaluated when it is assigned to LoopSIMD and in the operator += it interleaves the multiplication and addition such that the compiler can do the optimization. I already made a first attempt for this: https://gitlab.dune-project.org/nils.dreier/dune-common/-/commits/LoopSIMD_fma_expression_template . But I think it would be difficult to convince the CI machinery for that.

For a SpMV operation in dune-istl, I can observe a speedup up to 2x.

@joe I think you have the best overview over the SIMD things. Would be great to hear your opinion.