Improve performance of assembleGlobalBasisTransferMatrix
The old approach used a std::map and compared coordinates
in the inner most loop to avoid duplicate evaluations
of LocalBasis. Instead the new approach determines the
interpolation points first and caches the values in advaced
based on the evaluation order. This improved the performance
significantly.
This also removed tracking of already processed indices
in a std::unordered_set, since it turned out that in
all tested combinations this is slower than recomputing them.
Once we have a utility to generically create a suitable nested
bit-vector type for a basis, we can reintroduce this optimization,
because this would indeed improve the performance.
Furthermore this cleans up the includes and removes some no longer
used wrappers. Despite being implementation details, the function
and geometry wrappers have not been in an Impl:: namespace.
Hence there's a small possibility that someone used them elsewhere
outside of dune-fufem.