Performance transformation: Loop reordering in sumfact kernel
Performance transformation through loop nest reordering. There are two ways to reorder loops in a tensor contraction:
- Directly accumulate in output variable after setting to zero
- Accumulating in a large enough temporary
This merge request implements these ways of loop reordering and the possibility to create an autotune target directly from the loopy kernel.
Edited by René Heß