Skip to content

add openmp pragmas to loopsimd

Nils-Arne Dreier requested to merge loopsimd_add_openmp_pragmas into master

This MR adds OpenMP pragmas in the class LoopSIMD to enforce SIMD optimization. This helps in particular that the compiler uses FMA-instructions in expressions like

a += b*c

See: #223 (closed) and https://stackoverflow.com/questions/64682270/more-aggresive-optimization-for-fma-operations

On our Skylake machine this yields a speedup of up to 2 for Matrix vector multiplication:

g++-9 -O3 -march=native -fopenmp -mprefer-vector-width=512

without pragmas:

2021-09-24 09:19:03
Running ./simdmatapply
Run on (80 X 2401 MHz CPU s)
CPU Caches:
  L1 Data 32K (x40)
  L1 Instruction 32K (x40)
  L2 Unified 1024K (x40)
  L3 Unified 28160K (x2)
Load Average: 0.48, 16.85, 17.52
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_MatApply<1>        56782 ns        56782 ns        12333
BM_MatApply<8>       212008 ns       212012 ns         3298
BM_MatApply<16>      402430 ns       402427 ns         1737
BM_MatApply<32>      666676 ns       666689 ns         1014
BM_MatApply<48>      936289 ns       936301 ns          722
BM_MatApply<64>     1180108 ns      1180123 ns          572
BM_MatApply<96>     1739868 ns      1739876 ns          389
BM_MatApply<128>    2319212 ns      2319225 ns          293
BM_MatApply<192>    4261231 ns      4261318 ns          120


with pragmas:

2021-09-24 09:18:20
Running ./simdmatapply
Run on (80 X 2401 MHz CPU s)
CPU Caches:
  L1 Data 32K (x40)
  L1 Instruction 32K (x40)
  L2 Unified 1024K (x40)
  L3 Unified 28160K (x2)
Load Average: 0.69, 19.55, 18.38
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_MatApply<1>        56521 ns        56521 ns        12370
BM_MatApply<8>        88028 ns        88030 ns         7921
BM_MatApply<16>      134909 ns       134910 ns         5146
BM_MatApply<32>      283983 ns       283984 ns         2449
BM_MatApply<48>      496748 ns       496753 ns         1304
BM_MatApply<64>      665739 ns       665752 ns          981
BM_MatApply<96>      894130 ns       894148 ns          727
BM_MatApply<128>    1170710 ns      1170611 ns          555
BM_MatApply<192>    3690765 ns      3690801 ns          188

Merge request reports

Loading