Improve the performance of layout mapping by explicit loop (!61) · Merge requests · Simon Praetorius / Dune Tensor

Simon Praetorius requested to merge feature/performance-layouts into main Jan 14, 2024

Summary

The performance of the tensor access depends on the implementation of the corresponding mapping. In my tests it was better to write a loop to compute the mapping that an innerproduct of slices and extents. This needs to be investigated further. In the libc++ implementation of clang-18 they use a fold-expression of the comma-operator, essentiall beeing equivalent to the loop approach but completely unrolled. This was slightly slower in my test. But I only tested with gcc12. In clang the FieldTensor was at the end slower than a DynamicTensor if the rank is large. This is different in gcc.

Timings

Example: bench_ten_ten with N=4

Clang 18 (libc++)

Benchmark	Time	CPU	Iterations
BM_FieldTensor_FieldTensor	4577 ns	4576 ns	134356
BM_DynamicTensor_DynamicTensor	2634 ns	2634 ns	265871

Clang 18 (`index(i)*stride(i) +...+ 0`)

Benchmark	Time	CPU	Iterations
BM_FieldTensor_FieldTensor	4514 ns	4514 ns	135101
BM_DynamicTensor_DynamicTensor	3159 ns	3159 ns	222465

Clang 18 (for-loop - this MR)

Benchmark	Time	CPU	Iterations
BM_FieldTensor_FieldTensor	4524 ns	4523 ns	135125
BM_DynamicTensor_DynamicTensor	2608 ns	2608 ns	267791

GCC-12 (`index(i)*stride(i) +...+ 0`)

Benchmark	Time	CPU	Iterations
BM_FieldTensor_FieldTensor	1172 ns	1172 ns	588238
BM_DynamicTensor_DynamicTensor	5996 ns	5995 ns	115243

GCC-12 (for-loop - this MR)

Benchmark	Time	CPU	Iterations
BM_FieldTensor_FieldTensor	1140 ns	1140 ns	603939
BM_DynamicTensor_DynamicTensor	2931 ns	2931 ns	239035

Edited Jan 16, 2024 by Simon Praetorius

Improve the performance of layout mapping by explicit loop