[parallel] Add experimental support for thread parallel assembly
This adds experimental support for thread parallel assembly.
As discussed this opt-in feature which is implemented
by an additional template and constructor argument for Assembler
for passing an Executor
. If this is not specified a SequentialExecutor
is used that mimics the old behavior.
For parallel execution this adds the following experimental stuff (originating in dune-fufem)
to Dune::Assembler::Experimental
:
- A simple advancing front graph coloring algorithm. This is purely algebraic and independent on any grid structure.
- Utilities for computing the adjacency of the element graph in a grid view, coloring the element graph, creating a colored version of the element range.
- A simplified implementation of
std::barrier
from C++20 since gcc-10 does not provide it. - A
ColoredRangeExecutor
that provides thread-parallel execution of algorithms using a colored range based onstd::thread
.
Furthermore this demonstrates the usage in the poisson-pq2
and poisson-pq2-eigen
examples.
Since this leads to a change in indentation in the Assembler
methods,
the diff seems to be quite large. The actual change is just a few
lines that forward the methods body and element loop to the executor. At least
locally this can be seen using git diff -w
.
There is, however, one significant change: In order to guarantee thread-locality of data, the executor captures the local assembler by value and thus makes a copy. In principle we agreed that this behavior is intended. But one may argue, that this should be avoided in the sequential case. But this would make the executor usage more invasive.
Merge request reports
Activity
54 57 private: 55 58 const RowBasis& rowBasis_; 56 59 const ColBasis& colBasis_; 60 Executor executor_; 57 61 58 62 public: 59 63 60 Assembler (const RowBasis& rowBasis, const ColBasis& colBasis) 64 Assembler (const RowBasis& rowBasis, const ColBasis& colBasis, Executor executor) In std c++ algorithms, the execution policy typically comes as first argument. Maybe we can discuss whether to follow this pattern. We would, probably, have to write more constructors delegating to the full implementation constructor - but this is done anyways.
Edited by Simon Praetorius
278 302 } 279 303 if (elementContributionsComputed) 280 304 matrixPattern.scatter(get<0>(localViews),get<1>(localViews), localPattern[0][0]); 281 } 305 }); 306 }); 282 307 } 283 308 284 309 template <Concepts::LocalVectorAssembler<typename RowBasis::LocalView> LocalAssembler, class Vector> 285 310 void assembleVector (LocalAssembler& la, Vector& vector) added 7 commits
- 9e160568 - [parallel] Add graph coloring algorithm
- 0a61d356 - [parallel] Add graph utilities
- d41bd30e - [parallel] Add utilities for computing a colored element range
- 7ab76746 - [parallel] Add executor for parallel algorithms on colored ranges
- b7d4bd62 - [parallel] Add SequentialExecutor
- c10311dd - Allow to use an executor in global Assembler
- 5bcafd6e - [examples] Use parallel executor in ISTL and Eigen example
Toggle commit list- dune/assembler/sequentialexecutor.hh 0 → 100644
12 #include <dune/common/iteratorrange.hh> 13 14 #include <dune/assembler/parallel/barrier.hh> 15 16 namespace Dune::Assembler { 17 18 19 20 /** 21 * \brief A sequential executor 22 * 23 * This is the default executor used in the global Assembler class, 24 * that lets the Assembler use a sequential element loop. 25 * 26 * Note that besided beeing default constructible, the precise 27 * interface of this class is an implementation detail and may I doubt that the current pattern is enough to represent anything significantly more general. You can't even map a task-based approach like TBB to this. For now the main goal of the interface design was to be minimal invasive in the global assembler to not complicate a later design of a proper interface.
I think it would be good to check how other frameworks handle the abstractions for an execution context. @simon.praetorius already mentioned std, but there are also other we might want to have a look at, e.g. sycl, hpx, kokkos, ginko,...
I have found an older set of slides which include a couple of examples in different languages: https://www.khronos.org/assets/uploads/developers/library/2016-supercomputing/SC16-compareSYCL-Michael-Wong_Nov16.pdf
Regarding the order of arguments my impression is that there is no common pattern (yet). hpx follows closely the design of std, but besides this, things go wild.
I think non of them covers what we need here, because we need to also encapsulate the coloring information. Hence the executor here is neither a simple tag class nor some abstract execution mechanism for asynchronous/parallel code.
If we would want to have something like the latter, we would need to split the algorithm and the data (e.g. the coloring). This would require to also design an interface for such colored/partitioned data. Both this would be not be minimal invasive as desired here and orthogonal to this MR, because it deliberately does not propose an official interface for this except for passing a single additional opaque value with implementation defined interface.
Two notes on the current implementation:
- The
forEach
method of the executor gets a range argument which is traversed by theSequentialExecutor
while theColoredRangeExecutor
ignores the argument and uses the internal colored range. That's probably not the nicest solution. But since the whole interface is internal and experimental anyway, it's not a big issue now. - The
ColoredRangeExecutor
currently does not use chunks and using it with a single thread is slower than theSequentialExecutor
at least forUGGrid
. I have some local code computing a chunked coloring, but this needs some cleanup and testing.
- The
added 1 commit
- 94c348fe - [parallel][doc] Improve internal document of barrier
added 10 commits
-
f1978242...6181d6f3 - 2 commits from branch
main
- 8e1098c7 - [parallel] Add graph coloring algorithm
- 1c9dc171 - [parallel] Add graph utilities
- 6b7a37eb - [parallel] Add utilities for computing a colored element range
- 3ca72c1e - [parallel] Add executor for parallel algorithms on colored ranges
- 414e8403 - [parallel] Add SequentialExecutor
- 6b958ce2 - Allow to use an executor in global Assembler
- 3702b89a - [examples] Use parallel executor in ISTL and Eigen example
- e81a4694 - [parallel][doc] Improve internal document of barrier
Toggle commit list-
f1978242...6181d6f3 - 2 commits from branch
@oliver.sander, @christi Can you comment on this?
Maybe I should make the question more precise. So far the global assembler has the following interface:
auto a = Assembler<RowBasis,ColBasis>(rowBasis, colBasis);
This MR additionally introduces support for
auto a = Assembler<RowBasis,ColBasis,Executor>(rowBasis, colBasis, executor);
and a default
SequentialExecutor
that reproduces the old interface and behavior. This is the only interface addition proposed here. It provides the discussed minimal invasive hook-in mechanism needed to inject thread-parallel element loops and thus thread-parallel assembly. The interface of theexecutor
remains an implementation detail and is not proposed so far. The only additional change (but we discussed and agreed on this earlier) is that the executor is allowed to copy the local assembler. The question is, if anyone objects to this change.Additionally this MR provides an experimental prototype of a parallel executor that can be used as follows
auto coloredRange = Experimental::coloredElementRange(gridView); auto executor = Experimental::ColoredRangeExecutor(coloredRange, threadCount); auto a = Assembler(rowBasis, colBasis, executor);
Notice that all interfaces involved with the latter are explicitly marked experimental and not proposed as official interfaces here. But they are required to let dune-fufem depend on dune-assembler unless I want to copy all the global assembler code just to change one line for the element loop.
Edited by Carsten GräserAs long as the executor is optional, I'm fine with the proposed change. Looking quickly at the changes, it you didn't specialize for seq and threaded, but decided on an "adhoc" interface for the executor. It would be good to
a) document this interface b) mark it as exprimental c) revise/extend this interface after having gained some more experience (e.g. trying to implement a different executor
it you didn't specialize for seq and threaded, but decided on an "adhoc" interface for the executor
That's on purpose. Otherwise I could have copied the global assembler and would not need to discuss this here.
a) document this interface b) mark it as exprimental c) revise/extend this interface after having gained some more experience
Regarding a) and b): It is already documented and marked as experimental.
Regarding c): With the actual code I already have some experience, since it's been in use in fufem for two years. The actual interface to integrate it in dune-assembler is currently designed to be minimal invasive. It works well for the thread-based executor, but will not work for a task based approach. Collecting experience with this requires that someone tries to implement anaother executor e.g. based on TBB.
Thanks for a), b), this slipped my notice.
Regarding c), this is exactly the point, this interface must be experimental, until we really have gained experience with something more difficult. Then we need to discuss in more detail. I just wanted to make clear that I would not like to see here a "defacto" standard...
Not that I had the impression that you wanted to introduce this, but users might be tempted to assume the interface as such and thus we should be clear in the communication.
I just wanted to make clear that I would not like to see here a "defacto" standard...
That's why the word
experimental
shows up 21 times in the diff.Not that I had the impression that you wanted to introduce this, but users might be tempted to assume the interface as such and thus we should be clear in the communication.
For users of dune-fufem this will be true, because the experimental implementation in dune-assembler will replace the standard one in dune-fufem. As dune-fufem maintainer I accept that I'll have to adjust to potential changes in dune-assembler quickly. But knowing the quotient
in core and staging to date I'm pretty relaxed about the precise meaning of "quickly".
mentioned in merge request fufem/dune-fufem!214