Add ULFM revoke functionality to the MPIGuard
After a discussion with @smuething and @robert.kloefkorn at the PDESoft we've find a way to split the MR !450 (closed) into parts. Therefore we move all things related to the "Blackchannel" to a dedicated library, which uses the profiling-interface of MPI to wrap MPI calls and implements some ULFM functions. This enables us to use the Blackchannel approach with only small changes in dune. Furthermore this make the Blackchannel approach usable for other projects.
This MR adds the revoke functionality of the ULFM proposal to the MPIGuard
. I.e. when a process fails it calls MPIX_Comm_revoke
to interrupt all remote communication calls, and let all further calls fail until the MPIGuard
is finalized or deconstructed, which calls MPIX_Comm_shrink
. The changes are:
- add ULFM detection to the buildsystem
- checks for native support of ULFM by the MPI implementation at first
- For OpenMPI extention see http://fault-tolerance.org
- Caveat: The implementation in MPICH may lead to deadlocks due to http://trac.mpich.org/projects/mpich/ticket/2198
- if not found checks whether the blackchannel-ulfm library can be found (https://gitlab.dune-project.org/nils.dreier/blackchannel-ulfm)
- checks for native support of ULFM by the MPI implementation at first
- introduce
revoke
,shrink
andagree
method toCollectiveCommunication<MPI_Comm>
- For that we need to change the
MPI_Comm
private member variable tostd::shared_ptr<MPI_Comm>
. On that way I also introduced resource management by using an appropriate deleter forMPI_Comm
- For that we need to change the
- the MPI errorhandler throws exceptions
- adapt
MPIGuard
- extend mpiguardtest
Currently it can be opt-in by setting the cmake variable DISABLE_ULFM.