Skip to content

Add ULFM revoke functionality to the MPIGuard

Nils-Arne Dreier requested to merge feature/ulfm-mpiguard into master

After a discussion with @smuething and @robert.kloefkorn at the PDESoft we've find a way to split the MR !450 (closed) into parts. Therefore we move all things related to the "Blackchannel" to a dedicated library, which uses the profiling-interface of MPI to wrap MPI calls and implements some ULFM functions. This enables us to use the Blackchannel approach with only small changes in dune. Furthermore this make the Blackchannel approach usable for other projects. This MR adds the revoke functionality of the ULFM proposal to the MPIGuard. I.e. when a process fails it calls MPIX_Comm_revoke to interrupt all remote communication calls, and let all further calls fail until the MPIGuard is finalized or deconstructed, which calls MPIX_Comm_shrink. The changes are:

  • add ULFM detection to the buildsystem
  • introduce revoke, shrink and agree method to CollectiveCommunication<MPI_Comm>
    • For that we need to change the MPI_Comm private member variable to std::shared_ptr<MPI_Comm>. On that way I also introduced resource management by using an appropriate deleter for MPI_Comm
  • the MPI errorhandler throws exceptions
  • adapt MPIGuard
  • extend mpiguardtest

Currently it can be opt-in by setting the cmake variable DISABLE_ULFM.

Edited by Nils-Arne Dreier

Merge request reports