#1544 Make QuadratureRules::rule() thread-safe

The current implementation of the cache is a std::map. Possible fixes include

Using a Reader-Writer-Lock
Using RCU (read-copy-update)
Using a hierarchical data structure of vectors, indexed by geometry type, quadrature type, and quadrature order (up to a maximum). This needs call-once constructions to initialize a quadrature rule on demand.
Put the cache in thread-local storage.
Require explicit pregeneration of all needed quadrature rules.
means locking even to obtain the reader lock. Since quadrature rules are typically obtained once per element, this will typically be a bottleneck all threads have to go through for each mesh element.
Sounds great, in that there are implementations that put exactly zero overhead on a reader. But most information I could find was on the linux kernel implementation, which can employ tricks that are not feasible in a userspace implementation. Lastly, there may still be patent issues.
This needs preallocation of a lot of data. The call-once construction can be done with the C++11 facility for this, but the standard does not give any clue on how the performance is. Alternatively, we can initially store NULL pointers and enter a slow path for initialization later when we read a NULL pointer -- but there seems to be no guarantee that reading a pointer value is atomic.
This is currently done in dune-stuff, apparently. This may duplicate a lot of data in the case of many cores and p-adaptivity.
It can be difficult to figure out exactly what quadrature rules will be needed, in particular if p-adaptivity is involved. One way around this would be tell the QuadratureRules class to log every cache miss, and then use the log generate source code to seed the cache for production runs. This is quite fragile though, as production runs are typically larger, and may run into cases that are not seen during testing.

I'm going to try to implement 3. in exadune, lets see where that gets me.

3b) Use the hierarchical data structure just to store an atomic integer and a lock. The integer indicates the satus of the quadrature rule (0: not there, 1: begin constructed, 2: ready to use). The lock is used for locking during the construction only.

To obtain a quadrature we query the corresponding atomic. Let a be its value.

switch (a): { case 0: acquire lock; check atomic again and if its value is 2 release and return the rule set atomic to 1; create rule; set atomic to 2; release lock; return rule; case 1: wait until atomic is 2; case 2: return rule; }

Maybe we can even get rid of the locks and use the atomic for that. Then it would use the values 0 for not created, 1 for created and ready, 2+threadid for thread threadid is creating it.

@Markus: what you describe is basically std::call_once() and std::once_flag. I'd like not to reinvent the wheel, unless I can show that there is something seriously wrong with the existing ones.

The three patches contain a very preliminary version of this. This should explain what exactly I mean by hierarchical data structure.

TODO:

Check that all checks still pass that passed earlier.
Check whether -pthread or similar is needed, and if so, how to detect that without breaking cross-builds.
Validate that all relevant compilers support my C++11 constructs.
Check for performance regressions. In particular, also check on an architecture where synchronization/atomics may be expensive (ARMv7 or certain instances of POWER, see the talk by Herb Sutter http://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/).

Attachments

@joe Maybe I am missing something, but it seem like you have several subsequent calls of call_once with the call_once_flag when you are constructing the cache item. Doesn't this result in only the first call being invoked and the rest having not effect? Ergo I would assume that only the resize will happen and not the rest.

Do you really need the NoCopyVector? One might be able to abuse std::deque by using only emplace_back to prevent any attempts of copying. This might reduce the maintainance effort, but at the possible cost that the values are not in contiguous memory any more.

No, I'm using different once_flags for each level of the hierarchy. But I guess I should use more verbose names for each level of the hierarchy (instead of qt, g and p). I'll do that in a future iteration, that should avoid that confusion.

Indeed, it should be possible to use std::deque. I did overlook that deque::emplace_back() has relaxed requirements in comparison to vector::emplace_back(). In fact, it should even be possible to use deque::resize(n), since that default-constructs the new elements. The drawback is probably a larger constant in operator compared to std::vector or NoCopyVector, but lets see whether I can actually observe that first.

Here is the next version with better names and std::deque. Tests are currently failing, so this is definitely not final.

Attachments

The pthreads-linking issue has bitten again. Namely, one has to make sure to link with -pthread or similar, otherwise linking will work fine, but as soon as the program executes a thread function (or mutex, or call_once) strange things happen.

There are however ways to test whether thread functionality works at runtime. So what this patch-set does:

0001 intruduce a runtime check for call once (this check should be invoked by any code that uses call once to at least give a verbose error message instead of failing obscurely).
0002 Call this runtime check in a test in the test suite.
0003 Teach cmake to guess compiler and linker flags needed to get threads support, and to compile any program linkg to dune common with thos flags.
0004 Teach autotools to guess compiler and linker flags needed to get threads support, and to compile any program linkg to dune common with thos flags.

The autotools also does a check during configure whether these flags actually work by compiling and running a test program (this is skipped in the cross-compiling case).

Strictly speaking this is only needed in dune-geometry right now. I still put it in dune-common since the runtime check is pretty general and is likely to be used elsewhere.

@Christoph, Dominik: I'd like you input on the cmake stuff. In particular, whether I set the flags correctly, and whether there is a way to run a test program to see if the flags are working (unless we are cross compiling).

Attachments

This is the latest patch set for dune-geometry. The first to patches differ from the previous ones only in that they have beed rebased to the newest master. The third uses the facility from dune common from the previous post to make sure that call_once() works before using it.

Status of TODOs:

[DONE, works] Check that all checks still pass that passed earlier.
[PENDING Discussion] Check whether -pthread or similar is needed, and if so, how to detect that without breaking cross-builds.
[TODO] Validate that all relevant compilers support my C++11 constructs.
[TODO] Check for performance regressions.

Attachments

I have only read your patched, I did not apply and test them.

== 0002-Put-a-check-for-... == Looks good. Just about terminology: I think a check is performed by configure, I'd call it a test. I don't have sources to back up my claim. == 003-threads-cmake-guess-whet... ==

Why do you introduce a new FindSTDThread.cmake? Isn't it a C++11 feature? I'd check it there.
I'd prefer to have *StdThread.cmake instead of *STDThread.cmake because the former is better readable.
"if(NOT DEFINED STDTHREAD_FOUND)" is not needed, all command within the branch work in both cases.
set(STDThread_FOUND ${THREADS_FOUND}) don't do that by yourself, this variable is generated by find_package_handle_standard_args (FPHSA), use THREADS_FOUND as required variable in FPHSA
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${STDTHREAD_COMPILE_FLAGS} ") Have you tried this part? Did it work? I am not sure what happens if you re-run configure. Will it add the flag each time? And does it work at all, some people claim it does not: http://stackoverflow.com/questions/15100351/changing-cmake-cxx-flags-in-project

Overall your code seems ok.

OK, here ist the new version of the cmake patch (0003-...). The other three dune-common patches are unchanged.

I figured out how to run a test program (but only if we are compiling natively). Also, I adressed some of your points.

Why do you introduce a new FindSTDThread.cmake? Isn't it a C++11 feature? I'd check it there.

No strong feelings, so I put it into CheckCXX11Features.cmake instead as you suggest. This also makes a lot of your other points moot, since I no longer implement my own package.

I'd prefer to have *StdThread.cmake instead of *STDThread.cmake because the former is better readable.

Moot.

"if(NOT DEFINED STDTHREAD_FOUND)" is not needed, all command within the branch work in both cases.

That construct was meant to avoid overwriting variables set in the cache.

set(STDThread_FOUND ${THREADS_FOUND}) don't do that by yourself, this variable is generated by find_package_handle_standard_args (FPHSA), use THREADS_FOUND as required variable in FPHSA

?? I got that usage from the example in the documentation of FPHSA. Anyway, moot.

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${STDTHREAD_COMPILE_FLAGS} ") Have you tried this part? Did it work? I am not sure what happens if you re-run configure. Will it add the flag each time? And does it work at all, some people claim it does not: http://stackoverflow.com/questions/15100351/changing-cmake-cxx-flags-in-project

I did try this and it worked. This is also how "-std=c++11" is added at the top of CheckCXX11Features.cmake. Anyway, since find_package(Threads) does not set any compiler flags I removed this part.

This leaves the linker flags in CMAKE_EXE_LINKER_FLAGS. This does work at build time, if you link your executable to libdune-common. It does not appear to work at config time, contrary to CMAKE_CXX_FLAGS, so I had to set CMAKE_REQUIRED_LIBRARIES instead for compiling the sanity check.

Attachments

0003-threads-cmake-Guess-the-flags-needed-to-get-working-.patch

Looks ok so far. In general, CMake checks less thoroughly compared to Autotools. If a header or a library are found by the filename, that's considered enough. Sometimes they check for a specific symbol. But they often do not compile or try to link like many Autotool checks from Dune do. The CMake authors claim that users are able to handle these situations and test do not need to be over-protective. I am not sure whether you really need the run check.

Well, the problem with the missing pthreads is that the error you get is extremely non-obvious. It took me days to find out what the problem was. And it is easy to run into that problem.

Anyway, the testsuite-test needs to avoid lamdas. Here is a patch on top of the other dune-common patches to take care of that.

Attachments

0005-threads-tests-Don-t-use-lambdas-in-calloncectest-for.patch

These patches go on top of the previous dune-geometry patches. They fix issues with g++-4.4.

One issue is that g++-4.4's std::deque requires a copy or move constructor from the elements, so I reintroduce NoCopyVector.

The other issue is that g++-4.4 does not support lambdas.

Attachments

performance:

on x86_84, my laptop, I get for "time ./test_quadrature": 3.007s (sdev: 0.006164s) for the old code and 2.9955s (sdev 0.011210s) for the new code. mean over for measurements, "real" time.

performance:

on armhf (ARMv7), a chromebook, I get for "time ./test_quadrature": 17.2605s (sdev 0.045273s) for the old code and 16.7155s (sdev 0.007724s) for the new code. mean over for measurements, "real" time.

BTW, the test on x86_64 used g++-4.7.2 from debian wheezy, the one on armhf used g++-4.9.1 from debian jessie. Both tests used -g -O3.

I still have to make sure there is no bad performance degradation when I actually use multithreading. Singlethreading looks roughly OK. I need to make sure though that test_quadrature is actually a good benchmark; it might be the case that it requests each quadrature rule exactly once -- which would mean I haven't tested cache hits.

OK, here is a quite hard test to check the scalability of the initial cache miss. This test starts a number of threads, each thread running in the same order through all available quadrature rules up to and including dim=3, and requesting each rule. This should be a pretty hard test since all threads will most of the time run into the situation where they have to wait for the first thread to finish initialization, so they all have to spin on the same once_flag.

I did a weak scalability test on a xeon phi for this. There is a noticable slowdown as I use more threads, but even with 240 threads this is less than a factor of two. Since this is basically setup time, I think this is acceptable.

Attachments

0007-NO-PUSH-QuadratureRules-Benchmark-Problem-for-the-qu.patch

I created a feature branch feature/fs1544-threadsafe-quadrature (in dune-geometry), which contains my code proposal. This should be the same as the last set of patches posted here. I consider is stable enough that I don't expect no more rebasing.

I extended the benchmark a bit an ran it systematically. The results are in tests.org, I'll discuss them below. The scripts and logs are in benchmark.tar.gz. The benchmark program is in the branches p/joe/fs1544-threadsafe-quadrature-benchmark-old (for testing the current dune code) and p/joe/fs1544-threadsafe-quadrature-benchmark-new (for testing my proposal).

I benchmarked four scenarios with different compilers on different architechtures. These scenarios were

setup: using all quadrature rules ones initially;
all: using all quadrature rules multiple times with a preseeded cache;
cg: using the order 1 legendre quadrature rules on cubes multiple times with a preseeded cache; and
dg: using the order 1 legendre quadrature rules on cubes and each face of the cube multiple times with a preseeded cache. Here by "using a quadrature rule" I mean summing up the weights. These summed-up weights are passed out of the thread and displayed with the resulting timings to ensure the compiler's optimizer cannot possibly eliminate this use.

For "setup" and sequential operation, the new code and the old code have approximately the same performance. Sometimes the times deviate by up to ~30%, but since this is setup-time I consider this acceptable. For multithreaded operation I cannot compare with the old code, but I can look at the weak scalability. What I want to look out for here is worse-than-linear behaviour in the number of threads, which could potentially lead to unacceptable runtimes even for setup.

For the Xeon Phi I have a weak efficiency of 78% up to 60 threads, and still 60% at 240 threads (which means 4x hyperthreading). This is quite excellent, actually.
For amd64 (Core i5) I get efficiencies between 60% and 90% for two threads and 60% and 73% for four threads, depending on compiler. This is not so great, but in all cases better than linear.
For armhf (Cortex-A15) I get efficiencies between 36% and 53% for two threads, depending on compiler. This is quite worrysome, but then, ARMv7 is known to have expensive threading primitives. Also, there is considerable jitter in the timings here.

For the remaining tests any differences in timings between old an new code should not be the result of locking primitives, since threads not longer need to coordinate (as long as the implementation of std::once_flag is reasonable). Also, I always preseed the cache, so I can do the multithreaded benchmarks even with the old code.

For "all" the times are indeed very similar for the old and the new code across all architectures.

For "cg" and "dg" on amd64 the new code is always faster, in the extreme case by a factor of two. On the Xeon Phi the new code is also the faster one, although the factor here is more like 1.2 to 1.3. On armhf the story is weird. Old and new timings are within a factor of 1.25 of each other, but it depends on the compiler whether the new or the old code is the slower one.

Attachments

I guess this is a bit out of place, but this caching (which happens similarly for reference elements by the way) has worried me a few times. So I have the following suggestion:

We could create a new class QuadratureFactory (name open for discussion), which will never cache the quadratures but really create them (returning them either by value or as a unique_ptr.

Then, the QuadratureRules cache could become a utility-class. The user can choose to use it, but he could also use another implementation. This way, we could provide two default caches: QuadratureRules as it is and ThreadSafeQuadratureRules (names open for discussion). Now the user can choose which implementation he likes most.

There is a caveat, however. This way, all "library" code should no longer choose the QuadratureRules implementation itself but should ask the user to provide it (e.g. through an extra template argument). Otherwise, the user might end up having two caches, which is undesirable.

PS: As dune-fem does not use these quadrature rules, I will not object to the change.

At least with regard to thread safety I can report that the reference elements don't suffer the same problem as the quadrature rules.

They also use a cache (per dimension and per ctype), but this cache is a block-scope static variable and is completely filled the first time it is used. This is thread safe in C++11, and (I believe) in many g++ versions long before C++11.

@joe: You are right, the reference elements should be thread-safe enough. However, I was not referring to thread safety in this case. With the reference elements, the problem is actually that all of them are built on first access. In 10d, this means 1024 reference elements are initialized (approx. 2G of memory), even if I use only a cube grid. This agressive caching might be a good idea for dim <= 3, but not for all space dimensions.

In essense, I simply think it is bad DUNE style to force the user into one cache implementation. But I guess this is a separate issue and does not belong here. Sorry for the noise.

I have no strong opinion on the entire issue here, so please Jö do what you think is necessary.

I do agree with Martin though that from a design standpoint it would be nicer to separate the factory from the cache.

I just checked your common and geometry branches with CMake 3.1 and GCC 5. There were no warnings. No objections from my side. I guess a feature was rarely measured such scrupulous as this one. Thanks, Jö!

IMHO it's likely that people create custom rules that they want to cache on their own (we do this in dune-fufem). In view of this I also like the idea of splitting factory and cache. However this seems to be orthogonal to the thread safety discussed here.

To decouple the "alternate caches" discussion, I opened two feature requests: #1555 for quadrature rules and #1556 for reference elements.

I just asked Carsten in his role as realease manager, he does not see a problem with including this in 2.4.

I started a vote on whether to include this in 2.4 (http://users.dune-project.org/doodles/28), which ended two days ago. The outcome was: "Include in 2.4": 10 votes "Include in 2.4"/"Reject"/"Further Discussion": 1 vote not voting: 2 people

I just checked with Steffen whether he sees any conflicts with his work ongoing work on killing entitypointers. As expected, he does not see any, so I'm gonna go ahead and merge.

Merged in dune-common 911f9b095eb2c2f0dc4b5bac5a8aa2147127b8f8 and dune-geometry b661eaea75f4ea6a7d64304f50e11ecc20894c41. I ran a quick make check in dune-common, dune-geometry and dune-grid (3ff293efbe2d7ee1a4a4f24ccfadf74150f57d2e) to make sure that all tests behave as expected, which they do. That is the good news.

The bad news is that markus already discovered an issue with the use of nullptr, see #1563 (closed). I'll have a look.

I'm pruning the branches from the repository. The two benchmarking branches will be lost; to recreate them apply the patch attached here to dune-geometry 6f10aebba547f3ee80d47b85ad5a4af749a90b79 (for p/joe/fs1544-threadsafe-quadrature-benchmark-old) or fa6aa086b662fbd08746e13c3364bf978721e452 (for p/joe/fs1544-threadsafe-quadrature-benchmark-new).

Attachments

0001-NO-MERGE-QuadratureRules-Benchmark-Problem-for-the-q.patch

The bug reported by Markus was apparently transient, so I'm declaring this one done. Closing.

Status changed to closed

mentioned in issue #1555

mentioned in issue #1556

Property	Value
Reported by	Jö Fahlke (jorrit@jorrit.de)
Reported at	Dec 11, 2014 21:19
Type	Feature Request
Version	2.3
Operating System	Unspecified / All
Last edited by	Jö Fahlke (jorrit@jorrit.de)
Last edited at	Feb 12, 2015 12:48
Closed by	Jö Fahlke (jorrit@jorrit.de)
Closed at	Feb 12, 2015 12:48
Closed in version	Unknown
Resolution	Fixed
Comment

#1544 Make QuadratureRules::rule() thread-safe

Metadata

Description

Designs

Child items ...

Activity

Attachments

Attachments

Attachments

Attachments

Attachments

Attachments

Attachments

Attachments

Attachments

Attachments