`LeafNode::index` instead of `LocalView::index`?

Context: `dune-assembler`

During the last months, I have been re-implementing the PDELab function spaces against the dune-functions interface in a module called dune-assembler. The reasons to do so are various and not the point of this issue. The topic here is that during re-implementation I faced some performance problems related to the interface.

The dune-assembler module is largely unfinished and undocumented, so do not dig too much into its gory details.

Problem with `LocalView::index`

Now, the thing is that during benchmarking of factorized and vectorized local operators for DG convection-diffusion generated by dune-codegen (the fastest thing that we have around), I was faced with a major bottleneck on both the interface and the implementation of dune-functions. It could be that I am doing something stupid, but this is what I understand so far and why I claim this:

The interface in dune-functions requires the local view to provide a container index at the root node of the space. See LocalView concept check.
Such interface, forces the implementation to gather and store indices from all nodes regardless of the nature of the index. See Functions::DefaultLocalView implementation.
- For most function spaces, this is not a problem because the global pattern is irregular and storing them is the natural way to provide them back after bind().
- For DG spaces, the situation is different. The index is trivial to calculate because it is always an offset of the first index. This even holds after a for blocking and interleaving because our transformations happen from leaf-to-root (See for example post-child transformation in Functions::CompositePreBasis, PDELab::map_dof_indices_to_container_indices, or Assembler::EntityOrderingNode::firstContainerIndex). But such a trivial calculation is hard to realize from the LocalView without extra knowledge, so you just have to store the value.
In PDELab this is solved at bind-time, where the root node visitor takes only the first index of each leaf node. See PDELab::LeafLocalFunctionSpaceNode::bind. The problem with this approach is that the user needs to be aware of this and use a completely different assembler to ensure that the indexation does not lead to a segfault because other indices exist but are garbage. This is hard to see in PDELab, but is essentially the difference between the engines of PDELab::GridOperator and PDELab::FastDGGridOperator.
Thus, for the implementation of dune-assember I decided to deviate from the dune-functions interface and provide the index at the LeafNode, where each node is able to perform an on-the-fly computation or return a stored index depending of its knowledge about the space at that node. See implementation at leaf node Assembler::LeafNode::LocalView::index.

Benchmark

Now, my initial results seem to show that this reasoning is not far from reality. Functions and regular PDELab spaces are too slow to keep up with the speed of vectorized local operators and take much more time than the assembly time itself. The following is a figure that measures the time to assemble a residual for a vectorized local operator using different combinations of function spaces and assemblers with respect to the dune-functions interface.

Notes:

DA::FunctionsSpace: Function spaces from dune-assembler with leaf node index approach.
PDELab::FunctionsSpace: Function spaces from dune-pdelab with fast-dg optimization.
Functions::FunctionsSpace: Function spaces from dune-functions with root node index approach.
DA::Assembler: Assembler of dune-assembler with using dune-pdelab local operator interface.
PDELab::Assembler: Assembler of dune-pdelab with fast-dg optimization.
Regular PDElab function spaces + regular PDELab assembler are not shown but they are even slower than Functions::FunctionsSpace by an order of magnitude because it's doing many unnecessary things for DG.
Problem sizes are scaled to take approximately the same amount of DOF per problem so do not compare between different polynomial degrees.
I am not completely sure why the differences in time are so dramatic, but it certainly has to do with extra unnecessary memory pressure on the system and more time spend on the bind visitor.
It could be that I am doing something very silly with dune-functions, but my experience with PDELab and Assembler also points out that indexation at the root node (i.e. LocalView::index) is the underlying issue.
Compilers sometimes are able to see through the tree traversal and figure out the whole purpose of it, but in my experience, it happens so seldom that is better to actively act on this problem.

Proposal: `LeafNode::index` or `LocalView::index(path, node_dof)`

I do like the interface, but the LocalView::index is a major complication for me to attain to it. So I would like to propose an interface change that allows implementations to leverage the structure of DG layout:

(preferred) One form to do this is to move the index() function from LocalView to the LeafNode. I think this one is the most intrusive one but the one that gave me the best results in dune-assembler. Indeed, together with a LeafNode::path function, this was also the one that allowed me to seamlessly re-use PDELab local operators (like the ones generated by dune-codegen) with dune-functions function spaces.
Another way is to also pass the path to the desired leaf node: LocalView::index(path, node_dof). This seems to be easiest to implement in dune-functions but interface-wise looks more clumsy because node_dof may be easily confused with the localIndex of the root node. In this case, compatibility with PDELab local operators get more complicated.
Do you have any other ideas?

Edited Jun 23, 2022 by Santiago Ospina De Los Ríos

`LeafNode::index` instead of `LocalView::index`?

Context: dune-assembler

Problem with LocalView::index

Benchmark

Proposal: LeafNode::index or LocalView::index(path, node_dof)

Context: `dune-assembler`

Problem with `LocalView::index`

Proposal: `LeafNode::index` or `LocalView::index(path, node_dof)`