Debian-11-gcc-9-17 docker is non-deterministic during Python tests
Edit: TL;DR: The Debian 11 docker for the Python test has random failures with random errors in the assertions. Ther results of numpy.linalg.solve
where wrong. No problems in the DUNE related methods were found. For now we use the Ubuntu 20.04 docker. The observed problems do not occur there. Please try again once Debian 11 (stable) is released and docker images are updated.
Original Issue:
The test for the Poisson problem in Python (https://gitlab.dune-project.org/staging/dune-functions/-/blob/master/dune/python/test/poisson.py) shows some non-deterministic behavior and fails sometimes in the CI runner. On my local machine I was never able to cause a failure of the test. Double-checked that all DUNE modules are up to date. All investigation on my side lead to the conclusion that there has to be some undefined behavior going on in the code. I compiled everything with some sanitizers but no problems were reported. A run with valgrind
pointed out one suspicious event:
==28997== Invalid read of size 4
==28997== at 0x4FA3CA: ??? (in /usr/bin/python3.9)
==28997== by 0x51C50D: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==28997== by 0x515376: ??? (in /usr/bin/python3.9)
==28997== by 0x52D301: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==28997== by 0x516542: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==28997== by 0x52D162: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==28997== by 0x51B3AB: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==28997== by 0x52D162: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==28997== by 0xF639755: call (cast.h:2032)
==28997== by 0xF639755: pybind11::object pybind11::detail::object_api<pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr> >::operator()<(pybind11::return_value_policy)1, pybind11::module_&>(pybind11::module_&) const (cast.h:2187)
==28997== by 0xF7332CF: void Dune::Python::registerHierarchicalGrid<Dune::YaspGrid<2, Dune::EquidistantOffsetCoordinates<double, 2> >>(pybind11::module_, pybind11::class_<Dune::YaspGrid<2, Dune::EquidistantOffsetCoordinates<double, 2> >>) (hierarchical.hh:476)
==28997== by 0xF5BDD80: pybind11_init_hierarchicalgrid_966e2a5c8356c5b278ccd3acad180f0a(pybind11::module_&) (hierarchicalgrid_966e2a5c8356c5b278ccd3acad180f0a.cc:23)
==28997== by 0xF5BE0F1: PyInit_hierarchicalgrid_966e2a5c8356c5b278ccd3acad180f0a (hierarchicalgrid_966e2a5c8356c5b278ccd3acad180f0a.cc:16)
==28997== Address 0x6fff020 is 496 bytes inside a block of size 559 free'd
==28997== at 0x4847C73: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==28997== by 0x5399C8: ??? (in /usr/bin/python3.9)
==28997== by 0x5A2608: ??? (in /usr/bin/python3.9)
==28997== by 0x52F84C: ??? (in /usr/bin/python3.9)
==28997== by 0x51BED5: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==28997== by 0x515376: ??? (in /usr/bin/python3.9)
==28997== by 0x52D301: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==28997== by 0x54015E: ??? (in /usr/bin/python3.9)
==28997== by 0x540844: PyObject_Call (in /usr/bin/python3.9)
==28997== by 0x5181B5: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==28997== by 0x515376: ??? (in /usr/bin/python3.9)
==28997== by 0x52D301: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==28997== Block was alloc'd at
==28997== at 0x4842839: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==28997== by 0x4F5AA2: ??? (in /usr/bin/python3.9)
==28997== by 0x539725: ??? (in /usr/bin/python3.9)
==28997== by 0x5A2608: ??? (in /usr/bin/python3.9)
==28997== by 0x52F84C: ??? (in /usr/bin/python3.9)
==28997== by 0x51BED5: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==28997== by 0x515376: ??? (in /usr/bin/python3.9)
==28997== by 0x52D301: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==28997== by 0x54015E: ??? (in /usr/bin/python3.9)
==28997== by 0x540844: PyObject_Call (in /usr/bin/python3.9)
==28997== by 0x5181B5: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==28997== by 0x515376: ??? (in /usr/bin/python3.9)
At this point my knowledge of the Python bindings reached its end. What is the best strategy to debug this issue?