Improve the performance of layout mapping by explicit loop
Summary
The performance of the tensor access depends on the implementation of the corresponding mapping. In my tests it was better to write a loop to compute the mapping that an innerproduct of slices and extents. This needs to be investigated further. In the libc++ implementation of clang-18 they use a fold-expression of the comma-operator, essentiall beeing equivalent to the loop approach but completely unrolled. This was slightly slower in my test. But I only tested with gcc12. In clang the FieldTensor was at the end slower than a DynamicTensor if the rank is large. This is different in gcc.
Timings
Example: bench_ten_ten
with N=4
Clang 18 (libc++)
Benchmark | Time | CPU | Iterations |
---|---|---|---|
BM_FieldTensor_FieldTensor | 4577 ns | 4576 ns | 134356 |
BM_DynamicTensor_DynamicTensor | 2634 ns | 2634 ns | 265871 |
index(i)*stride(i) +...+ 0
)
Clang 18 (Benchmark | Time | CPU | Iterations |
---|---|---|---|
BM_FieldTensor_FieldTensor | 4514 ns | 4514 ns | 135101 |
BM_DynamicTensor_DynamicTensor | 3159 ns | 3159 ns | 222465 |
Clang 18 (for-loop - this MR)
Benchmark | Time | CPU | Iterations |
---|---|---|---|
BM_FieldTensor_FieldTensor | 4524 ns | 4523 ns | 135125 |
BM_DynamicTensor_DynamicTensor | 2608 ns | 2608 ns | 267791 |
index(i)*stride(i) +...+ 0
)
GCC-12 (Benchmark | Time | CPU | Iterations |
---|---|---|---|
BM_FieldTensor_FieldTensor | 1172 ns | 1172 ns | 588238 |
BM_DynamicTensor_DynamicTensor | 5996 ns | 5995 ns | 115243 |
GCC-12 (for-loop - this MR)
Benchmark | Time | CPU | Iterations |
---|---|---|---|
BM_FieldTensor_FieldTensor | 1140 ns | 1140 ns | 603939 |
BM_DynamicTensor_DynamicTensor | 2931 ns | 2931 ns | 239035 |
Edited by Simon Praetorius