This is my proposal for the meaning of viewThreadSafe for the grid interface. I want to discuss it at the 2015 developer meeting.
I'm still working on it and may do some shuffling around of sections and adding of examples, but the core authoritative specification is done and I want to request comments from other developers.
Note that the proposal does not mention caching (at best some non-binding guidelines are given). This implicitly means that any caching must be transparent to the interface user.
I know that explicit caching ("you have to request the index set once in a sequential section before you can request it concurrently") was requested at a previous meeting. The problem with that is that then you essentially have to specify what happens for every method individually. When I tried to write a specification that included provisions for caching everywhere, I essentially ended up with what is now the appendix -- except that it was even longer and more complicated. In every object associated with the grid I had to consider subobjects for caches. It was made even more complicated because initializing one cache may modify other caches, so interactions between caches had to be considered.
In the current proposal, the authoritative specification is one page -- everything else is just clarification and checking that it is possible to implement that specification.
Here is the newest proposal (revision c354ea03955782349c20f313c9f71d960179c4f7). I've looked at various geometry implementations to estimate whether they are thread-safe according to the specification, and if not, to suggest how to fix them. The implementations I've looked at are:
from dune-geometry AxisAlignedGeometry, AffineGeometry, MultilinearGeometry and CachedMultilinearGeometry
from dune-grid AlberaGrid (without DUNE_ALBERTA_CACHE_COORDINATES=0), GeometryGrid, IdentityGrid, OneDGrid, UGGrid and YaspGrid
the geometries from dune-alugrid
I'v also done the same for the ALUMemoryProvider from dune-alugrid.
This is after the developer where a few points were raised.
Robert K. wants tests that test stuff thread parallel to run them with helgrind. These tests should do stuff that should be thread-safe, but there is not need to try extra hard to provoke any hidden races, helgrind should be able to find missing synchronization on its own.
Two functions were added to the list of functions that may communicate: globalRefine() and globalIDSet().
Steffen and Oliver found a case that cannot be neglected where performance might suffer intolerable as detailed below:
Consider a affine geometry that can compute a Jacobian inverse. There are several use cases:
The inverse is requested often for the same entity (e.g. on many quadrature points).
The inverse is requested only once per entity, (e.g. it is only evaluated in the center).
The inverse is never requested and it isn't even guaranteed that computing the inverse succeeds.
Computing the inverse only when it is needed for the first time and storing the result accounts for all three cases: the inverse is computed only once (1), there is little overhead (2) and if the inverse is never computed (3) there is nothing that can fail.
Requiring thread-safety makes it necessary to turn the cache-guard into some synchronization contruct (e.g. a once_flag). This will work in cases (1) and (3). It will also work in case (2), but may be a huge performance penalty, since the once_flag must be written in a synchronized fashion. This may require communication in any case on some architectures, and may overload the bus.
I am currently investigating how severe the penalty actually is. If it is too large, we have still some options:
Ignore the const-means-threadsafe paradigm in this case. CON: will surprise users.
Make the interface for functions that might initialize a cache it at least some implementations non-const. CON: interface change; similar areas on the iterator need "for(auto && e : elements(gv))" to work nicely, but there are bugs in gcc that prevent that syntax from working (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63506).
Grids that have this problem cannot be viewthreadsafe. CON: there are ways how such grids can be use threadsafely, but specifying how becomes complicated.
geo.jacobianInverse() returns a function that can be evaluated with a local coordinate. CON: doesn't (yet) cover the case of requesting an entity from an iterator, or requesting a geometry from an entity.
Thus I am currently leaning toward option 3 from the previous post, with the longer-term plan to migrate to option 4. I'll have to talk to Steffen about option 4.
The CachedMultiLinearGeometry differs from the MultiLinearGeometry slightly in more than "just a bool flag". Upon initialization, it checks whether the geometry is affine and, if so, it stores the jacobianTransposed for later use. Apart from the by-product of computing the JacobianTransposed (which needs to be computed, anyway), this check comes with some overhead (containing extra computations and branches).
So, without having fully looked into your test result, 18% seems reasonable to me. Whether this overhead is justified or whether it should be enforced by a grid manager is a different question that has no impact on threading.
@Christian: After doing the numbers from the actual results (as opposed to "from memory"), the overhead is 67% for full thread-safety compared to no caching. Which does not change the conclusion.
@Christian: Of course. My comment on the comparison with the CachedMultiLinearGeometry only had the intention to prevent jumping to conclusions: Simple caching ("a bool") cost 20%, so 67% is acceptable. That is not so.
@Martin: Good point. We cannot avoid the affine() check, no matter what kind of caching we do, as long as any of the potentially cached values are used. We can probably avoid it in the case of option 4. (jacobianInverseTransposed() and integrationElement() return functions instead of valued) when the program needs neither the integrationElemend()/volume() nor the jacobianInverseTransposed(). However, that needs careful consideration how to design the interface, e.g.
do we give the program the option to pass the integrationElement() function into the jacobianInverseTransposed() method to avoid the need to recompute it there? Then that method could extract the affine flag from the passed-in integrationElemnt function.
do we do it the other way round, i.e. if both the jacobianInverseTransposed and the integrationElement is needed, will jacobianInverseTransposed() yield both?
In any case, we probably do not want to force the program to extract the affine flag and pass it into either integrationElement() or jacobianInverseTransposed(), because that seems rather low-level information. I would consider it OK in the case of the pair integrationElement/jacobianInverseTransposed, because the math already tells us that the integrationElement is needed in/is a byproduct of the computation of the inverse, so that isn't really an implementation detail.
42% still seems very much. As you said... it would be possible to avoid the issue by using thread local storage. The downside is that it requires interface changes and, we have to admit, a rather ugly new interface.
What we could do, but I'm unsure whether it would be worth the effort is turning the actual Jacobian implementation into something that does the caching and the evaluation. See the following very rough sketch
I don't think we should try to find an efficient interface for the caching geometric quantities in a multithreaded environment. It is bound to become ugly.
But Jö's suggestion looks interesting to me: If a method jacobianInverseTransposed() returns a function, this new object can internally do the caching. The geometry is not changed. As this function is only known to the thread asking for it, threading issues become a user problem. If you want to compute this function 8 times, I'm ok with it. It might still be a large gain. If, however, you only use the entity in one thread, there won't be any race conditions anyway.
As to the side effect of the integration element: This is not a new interface issue. Unless the integration element can be cached due to an affine mapping, we already compute it twice for every integration point requiring both, the integration element and the jacobian inverse. Therefore, I would keep this discussion separate.
I think most race conditions between caching and threading boil down to lazy evaluation. Unfortunately, our rich interfaces provide lots of information that we neither want to be evaluated overeager nor twice (e.g., jacobianInversTransposed, integrationOuterNormal, neighbor, indexSet). I think we should overhaul these interfaces such that we have no performance impact, if the evaluation is done eagerly at the first point where
the information can be computed
we know the information is going to be used
Then, there is no reason to use any non-threadsafe caching.
On the downside, the user must handle more objects. To achieve maximum performance, you can no longer simply pass the geometry to some function, but must also pass jacobianTransposed, integrationElement, ... --- whatever you need.
Just to clarify: with option 4. I mean something like this
auto jacInvT = geometry.jacobianInverseTransposed(); // <-- affine is checked here with result stored // in jacInvT, geometry remains unmodifiedauto jacInvTAtX = jacInvT(x); // <-- jacInvT remains unmodified, operator() is const
@martin.nolte if I understand you correctly, then you are talking about either this (I'll call it 4a):
auto jacInvT = geometry.jacobianInverseTransposed(); // <-- affine is not checked hereauto jacInvTAtX = jacInvT(x); // <-- affine is checked here, result is stored in jacInvT // with synchronization, operator() is const
or this (I'll call it 4b):
auto jacInvT = geometry.jacobianInverseTransposed(); // <-- affine is not checked hereauto jacInvTAtX = jacInvT(x); // <-- affine is checked here, result is stored in jacInvT // without synchronization, operator() is not const
@joe I actually meant your option 4:
auto jacInvT = geometry.jacobianTransposed(); // if affine, jacobianInverseTransposed is
// computed here
auto jacInvTAtX = jacInvT( x ); // in affine case: copy stored "matrix"
// otherwise compute inverse here