All 'generic' finite element implementations, i.e., those building up on the infrastructure in dune/localfunctions/utility/ make heavy use of reinterpret_cast breaking strict aliasing rules.
As a consequence evaluation of the basis functions leads to undefined behaviour.
For those not aware of strict aliasing errors: A typical manifestation of this is, that optimized code may silently compute wrong numbers in special cases because the compiler makes wrong assumptions on when to sync register values with memory.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Sure, that'S preferable. But someone has to do this. While I did to some minor cleanups to this code recently, it still looks scary to me and in my impression the offending reinterpret_casts are depely tied to the whole code structure. Probably @andreas.dedner is the only one who could do this without doing some serious code archeologogy.
For the record: At least for his hp-DG implementation @lasse.hinrichsen observed that implementing runtime-order using type-erasure is significantly faster compared to an implementation with truly dynamic order. Notice that the price of type-erasure is paid here anyway, because it's used to switch wrt. topology types.
Using type erasure is of course only relevant in the case of using a hybrid grid which in combination with DG will not be the typical use case? Then again after caching is employed....
No, he's using type erasure wrt. the order. In fact it's based on the LocalFiniteElementVariant which has the drawback that you need to fix the maximal order. On the other hand there's many statements about std::visit being potentially faster than a virtual call.
I checked a little bit the code and all evil seems to originate from tensor.hh. The tensor implementation currently stores things in FieldVector and contains these as a member. The casts are used to map these types.
I think the best approach would be a proper interface definition of tensor and then a clean new implementation.
I also think that the ominous block() method should not be necessary, as it gives access to the internal data and does the reinterpretation.
My idea would be that we introduce a light weight wrapper for the tensors with entry-points to the actual data. The sizes are template parameters, so that these classes really would only consist of a pointer. The costs would be the same as for the reference we are currently handing out. If I understood everything correctly, the problem arises when going through the derivatives. As these need the initial coefficients anyway, we should be able to store these outside and then just work on this external storage.
@andreas.dedner I just tried out the type erasure approach for dynamic order k and tested it with the poisson-pq2 example in dune-functions. It turns out that incorporating the dynamic k into a LocalFiniteElementVariant based cache did not change the runtime compared to a static k (as expected) while using the runtime order leads to an increase of ~10% runtime for assembling a stiffness matrix for k=2.
Some more measurements: For order k=1 the overhead of the runtime version is about 20%. For k=3 it's almost equally fast and for k=4 the runtime version is even faster.
@christi Short answer: No. Long answer: I locally created a LocalFiniteElementVariant-based cache to implement dynamic k with underlying static k implementations. I can provide it, but it's quite hacky and maybe bot what you're looking for.
But if you're only interested in the (surprisingly) better behaviour of the fully dynamic implementation, this does not need the cache. You can simply see this by changing the order in poisson-pq2 to 4 and switching between
As far as I can see the monomial-based dynamic implementation should be linear in the order while I suspect that the Pk/Qk implementations are maybe not.
Just a remark (or perhaps I'm misunderstanding Carsten's test): the dynamic order is more a by-product of the approach that Martin and I took it was more about implementing the functionals and then having a automatic mechanism to construct the correct basis functions. Speed was never that much of an issue because we assumed the use of a quadrature cache anyway. Compiler time improvement is of course another issue :-)
It is also not so easy to solve. I made one attempt but then stoped due to time limitations. I would leave this issue open as a reminder. Maybe with a proper tensor implementation + slicing, spans and subspan, this all can be implemented very nicely. But this needs time.
We could just add a comment in the code (is added in dune-functions in global bases that use these local functions, for example), but those are mostly forgotten about. (There is a recent news about a python package for tar archives with a bug in useful functions that was not fixed but just a comment added that you should be careful. Now 15years later about 350,000 python packages have a security bug)