add openmp pragmas to loopsimd
This MR adds OpenMP pragmas in the class LoopSIMD
to enforce SIMD optimization. This helps in particular that the compiler uses FMA-instructions in expressions like
a += b*c
See: #223 (closed) and https://stackoverflow.com/questions/64682270/more-aggresive-optimization-for-fma-operations
On our Skylake machine this yields a speedup of up to 2 for Matrix vector multiplication:
g++-9 -O3 -march=native -fopenmp -mprefer-vector-width=512
without pragmas:
2021-09-24 09:19:03
Running ./simdmatapply
Run on (80 X 2401 MHz CPU s)
CPU Caches:
L1 Data 32K (x40)
L1 Instruction 32K (x40)
L2 Unified 1024K (x40)
L3 Unified 28160K (x2)
Load Average: 0.48, 16.85, 17.52
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_MatApply<1> 56782 ns 56782 ns 12333
BM_MatApply<8> 212008 ns 212012 ns 3298
BM_MatApply<16> 402430 ns 402427 ns 1737
BM_MatApply<32> 666676 ns 666689 ns 1014
BM_MatApply<48> 936289 ns 936301 ns 722
BM_MatApply<64> 1180108 ns 1180123 ns 572
BM_MatApply<96> 1739868 ns 1739876 ns 389
BM_MatApply<128> 2319212 ns 2319225 ns 293
BM_MatApply<192> 4261231 ns 4261318 ns 120
with pragmas:
2021-09-24 09:18:20
Running ./simdmatapply
Run on (80 X 2401 MHz CPU s)
CPU Caches:
L1 Data 32K (x40)
L1 Instruction 32K (x40)
L2 Unified 1024K (x40)
L3 Unified 28160K (x2)
Load Average: 0.69, 19.55, 18.38
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
BM_MatApply<1> 56521 ns 56521 ns 12370
BM_MatApply<8> 88028 ns 88030 ns 7921
BM_MatApply<16> 134909 ns 134910 ns 5146
BM_MatApply<32> 283983 ns 283984 ns 2449
BM_MatApply<48> 496748 ns 496753 ns 1304
BM_MatApply<64> 665739 ns 665752 ns 981
BM_MatApply<96> 894130 ns 894148 ns 727
BM_MatApply<128> 1170710 ns 1170611 ns 555
BM_MatApply<192> 3690765 ns 3690801 ns 188