[bug] Solving in parallel, ISTLBackend_OVLP_GMRES_ILU0 leads to segmentation fault, if number of DOF is high.
Hardware: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (20 CPUs)
OS: CentOS Linux release 7.6.1810 (Core)
Compiler: gcc-6 (GCC) 6.3.0
Dune modules are used: codegen, common, functions, geometry, grid, istl, localfunctions, pdelad, testtools, typetree, uggrid
Version: The problem was found at the end of April. At the moment, I used the latest master commits (by dunecontrol git pull)
External software: ParMETIS (for domain decomposition)
We are solving a problem in parallel. When we use
BCGS linear solver we don't have the bug. If we switch the solver to GMRES and have all the rest the same we get first "
IF_FUNCNAME: receive-timeout for IF 6" then in a couple of seconds we get a segmentation fault. The bug depends on the number of DOFs.
1. Steps to reproduce
I can give you access to our private dune-richards project with the branch, where the bug occurs. In the project, we solve a 3D problem by DUNE-PDELab on an unstructured grid of 4866* N cubic elements with domain decomposition on 20 CPUs. The problem is nonlinear, so we use Newton method. And we solve it with 4th order DG (cubic functions), which means 4^3 DOFs per each grid cell.
2. Expected behavior
With different N, we expect Newton interations are running and then it either converges or not.
We get the expected behavior if we use
Dune::PDELab::ISTLBackend_OVLP_BCGS_ILU0 linear solver with N from 2 to 5. And we get the expected behavior with
Dune::PDELab::ISTLBackend_OVLP_GMRES_ILU0 but only with N=2.
3. Actual behavior
With higher N,
Dune::PDELab::ISTLBackend_OVLP_GMRES_ILU0 leads to a segmentation fault. The higher N is, the earlier in computational run time the problem occurs. If N=5, the bug happens on the first step:
TIME STEP [Alexander (claims order 3)] 1 time (from): 0.0000e+00 dt: 5.0000e-01 time (to): 5.0000e-01
STAGE 1 time (to): 2.1793e-01.
Initial defect: 2.4025e+00
Then it calculates for some time and leads to the error message:
IF_FUNCNAME: receive-timeout for IF 6
waiting for message (from proc 17, size 25600)
waiting for message (from proc 15, size 104960)
Then in a couple of seconds, it leads to the long error message, which you can find in the attachments: log.err
Then in a couple of seconds, it leads to a segmentation fault
mpirun noticed that process rank 16 with PID 134734 on node gauss2 exited on signal 11 (Segmentation fault).
It seems like the linear solver has a timeout counter inside for some reason. But the timeout can be too short for numerically expensive problems.