Skip to content

Improve the performance of layout mapping by explicit loop

Simon Praetorius requested to merge feature/performance-layouts into main

Summary

The performance of the tensor access depends on the implementation of the corresponding mapping. In my tests it was better to write a loop to compute the mapping that an innerproduct of slices and extents. This needs to be investigated further. In the libc++ implementation of clang-18 they use a fold-expression of the comma-operator, essentiall beeing equivalent to the loop approach but completely unrolled. This was slightly slower in my test. But I only tested with gcc12. In clang the FieldTensor was at the end slower than a DynamicTensor if the rank is large. This is different in gcc.

Timings

Example: bench_ten_ten with N=4

Clang 18 (libc++)

Benchmark Time CPU Iterations
BM_FieldTensor_FieldTensor 4577 ns 4576 ns 134356
BM_DynamicTensor_DynamicTensor 2634 ns 2634 ns 265871

Clang 18 (index(i)*stride(i) +...+ 0)

Benchmark Time CPU Iterations
BM_FieldTensor_FieldTensor 4514 ns 4514 ns 135101
BM_DynamicTensor_DynamicTensor 3159 ns 3159 ns 222465

Clang 18 (for-loop - this MR)

Benchmark Time CPU Iterations
BM_FieldTensor_FieldTensor 4524 ns 4523 ns 135125
BM_DynamicTensor_DynamicTensor 2608 ns 2608 ns 267791

GCC-12 (index(i)*stride(i) +...+ 0)

Benchmark Time CPU Iterations
BM_FieldTensor_FieldTensor 1172 ns 1172 ns 588238
BM_DynamicTensor_DynamicTensor 5996 ns 5995 ns 115243

GCC-12 (for-loop - this MR)

Benchmark Time CPU Iterations
BM_FieldTensor_FieldTensor 1140 ns 1140 ns 603939
BM_DynamicTensor_DynamicTensor 2931 ns 2931 ns 239035
Edited by Simon Praetorius

Merge request reports