CGSolver should stop immediately if the defect is NaN

I guess we can probably agree on that the solver should stop immediately. My question is: in which way should it stop?

Exception? (Which one? What happens to the InverseOperatorResult &res parameter?)
Simply returning after setting res.converged=false (and the other members of InverseOperatorResult to some other meaningful value, perhaps NaN) as soon as NaN is detected?

I consider NaNs in linear system a bug in user code. This should be caught there or somewhere else using floating point exceptions SIGFPE. Introducing a performance penalty (which I would think such a check imposes) is probably a bad idea.

I agree with @markus.blatt. Do you really need the solver to catch NaNs or are you just scratching your sense of beauty?

I'm not sure we are talking about the same thing here.

Here is a sketch of what I would change:

--- a/dune/istl/solvers.hh
+++ b/dune/istl/solvers.hh
@@ -472,9 +472,10 @@ namespace Dune {
           this->printOutput(std::cout,real_type(i),defnew,def);
 
         def = defnew;               // update norm
-        if (def<def0*_reduction || def<1E-30)    // convergence check
+        if (def<def0*_reduction || ! (def > 1E-30))    // convergence check
         {
-          res.converged  = true;
+          using std::isnan;
+          res.converged  = !isnan(def);
           break;
         }

The convergence check happens once per iteration. There is a bit of reshuffling in the logic needed, but I doubt that will lead to any measurable performance impact. In any case, it pales against the needless def0*_reduction, which is happening in every iteration anyway. The actual check for NaN only happens once, not in every iteration. And even if there would be an additional floating point operation per iteration (which would allow the code to be much more readable), I honestly fail to see a scenario where this is something to worry about.

If you still think that is an issue, I would like to know more about the scenario you have in mind.

I don't worry about efficiency, but rather about the additional magic you want to stuff in their. Looking at your code example, I don't really like the def<1E-30 in there either. (I know it wasn't you who wrote it).

@markus.blatt was worried about performance.

I agree with @oliver.sander that the absolute convergence check def<1E-30 is magic. Not because it is there at all, after all, it is needed. But because the limit cannot be adjusted, and because we don't tell anyone about it in the documentation.

A check for NaN however is not magic. You've crossed the limit of what the floating point representation can do for you, your result is gonna be bogus. If you keep iterating, your result is still gonna be bogus. There is no magic number (like 1e-30) here.

My proposal is to stop as soon as we know we cannot produce a meaningful result.

I don't like @markus.blatt's suggestion about using SIGFPE, because I don't think there is a clean way to recover from that in C++. You could try doing longjmp() out of the signal handler, but you'd have to do so out of the solver, which means skipped destructors, probably leaked memory and generally undefined behavior.

My suggestion was that the user should activate and catch SIGFPE in his/her code. I have done that a couple of times.

Sorry, I don't get it. If your defect becomes NaN, there is no point at all in continuing to solve, or is there? We won't ever converge anyway, because the NaN is not going to go away. And sorry, but playing around with SIGFPE is a nightmare, in particular in multi-threaded code (and I don't want to explain that to a user)...

And if you don't want the weird comparison: Just use isnan(def0). That's just a simple bitwise comparison and dirt cheap, at least compared to everything that happens within each iteration

Which probably won't work with -ffastmath... Even if it does work it won't help finding the cause of the problem, which is somewhere else (in Joe's example and in all NaN that I experienced so far). Finding causes for NaN is the real nightmare, I guess.

If you insist, feel free to add it. I would use a specialized exception.

And, by the way, a NaN in the input data is not the only way to get a NaN during the solve. Inf is enough, and I don't think you can check for that with a SIGFPE.

#include <iostream>
#include <limits>

int main()
{
    std::cout << std::numeric_limits<double>::infinity() * 0.0 << std::endl;
}

The will get you a NaN.

All right, I will prepare a patch.

Reassigned to @joe

mentioned in merge request !36 (merged)

There is now merge request !36 (merged) which introduces a specialized exception as @markus.blatt suggested. The exception is called SolverAbort, is derived from ISTLError, and meant for cases when the solver knows continuing does not make any sense.

This seemed appropriate for BiCGStab and GMRes too, so I prepared patches for them as well (included in !36 (merged)). They were already detecting "breakdown", but were throwing ISTLError previously.

I found another example where NaN appears in the defect, this time without anything wrong in the input data that is easy to detect: when the system does not have a solution. E.g. when trying to solve A*x=b with CG and Richardson, where

    / 1  1 \        / 1 \               / 0 \
A = |      |,   b = |   |,  initial x = |   |
    \ 1  1 /        \ 2 /               \ 0 /

The defect will become large, then small again, then grow again, and after 46 iterations will be -NaN and stay that way.

I think I also found an example where the time saved matters for production: when you use your linear solver inside some nonlinear solver, e.g. a Newton with linesearch. I haven't studied those in detail, but is it not the case that the nonlinear solver may attempt to linearize the system at a position where it is not solvable, and if the linear solver does not converge, retry at a different position? If the linear solver fails as early as reasonably possible, that will save time.

Status changed to closed by commit 538d99fa

Status changed to closed by merge request !36 (merged)

mentioned in commit 538d99fa

CGSolver should stop immediately if the defect is NaN

Designs

Child items ...

Activity