Skip to content

[draft][threading] Use parallel loops/allocation in BCRSMatrix

Carsten Gräser requested to merge feature/parallel-bcrs-allocation into master

This patch uses parallel loops when initializing the large arrays of the CRS storage. When using multiple threads in assembly, the serial construction of the matrix takes a significant fraction of time that does not scale.

It turns out that most of this time is due to allocation which seems to be serial. However, this is not entirely true. When doing a plain allocation the OS only provides the address space to the application but does not acutually associate memory pages. The latter happens when the allocated memory is first accessed. Indeed, detailed measurements reveal that the aactual llocator call is very fast, while initialization of the storage is costly. Especially, it is significantly more costly than doing exactly the access a second time.

The current patch uses a simple std::for_each(execution::par,...) for the first initialization. Measurements indicate that this seems to effective parallelize the expensive part of allocation.

E.g. when assembling the matrix for a lowest order Taylor-Hood discretization of the Stokes problem with 4 threads this reduces to total assembly time (assemble pattern + create BCRSMatrix + assemble values) by about 20%, because it reduces the serial part.

This is marked as draft because:

  • So far we do not use parallelism in dune-istl. This may need to be discussed. In several places this is not trivial, but here it seems to be free lunch.
  • There are many more places where one may use parallel loops. I only picked the one that takes by far most of the time when building the matrix.

Merge request reports

Loading