# Input

The LICM algorithm is implemented as a series of loopy kernel transformations. The input loopy kernel has already been
vectorized and other transformations, like constant folding (e.g. `2 * 4 -> 8`

), have been applied. The kernel is
preprocessed and scheduled before the actual algorithm. If the input kernel already has a schedule, this is a no-op.

# Algorithm

As a preliminary step, the kernel instructions are split into instructions, which can be modified through CSE,
`Assignment`

and `CallInstruction`

, and which cannot, everything else, but most importantly `CInstructions`

. The latter
instructions are ignored during the rest of the algorithm and added the resulting kernel.

The algorithm consists of the following steps:

- determine minimal iname set for all sub expressions
- use loopy to generate a schedule for the kernel
- deduce iname nesting from schedule
- valid scope of temporaries

- rewrite each sub expression to group expression with the same minimal iname set
- update minimal iname sets

- compute cost of each sub expression within its instruction
- build sub expression graph of cses to extract
- build new kernel
- precompute cses
- add cses to schedule
- add cses to original instructions

## Minimal Iname Sets

The first step is to determine for all sub expression the minimal set of inames it depends on. A sub expression denotes a node in the pymbolic AST. This needs additional information, which is not present in the input kernel. The first one is the nesting of the loop domains and the second is the scope of temporary variables.

The nesting of the loop domains is simply given from loopy. It should be noted, that the computation of the schedule is extremely slow compared to the rest of the cse algorithm, but it is a necessary step in the code generation process, even if LICM is not applied. As a result, the LICM transformation can take up a significant (>50%) part of the runtime.

If an sub expression contains a temporary variable its inames must be nested within the valid scope of the temporary. For a given temporary its valid scope is the iname set which initializes the variable, discarding possible indexing inames of the temporary. As an example, in the most common case of the value or gradient of the trial function with the following loop nest

```
for q
for ex, ey
u = reduction((ix, ey), phi[q, (ix, iy)] * x[ix, iy, ex, ey])
for d
grad_u[d] = reduction((ix, ey), phi[q, d, (ix, iy)] * x[ix, iy, ex, ey])
```

the valid scope of both variables is the set `q, ex, ey`

. Other examples for temporaries may include the macro-element
corners, which are initialzed completly outside of the quadrature loop and therefore their valid scope is the entire
kernel.

With these additional informations the inames a sub expression depends are determined. This is done in two steps, first
the precidates of all instructions are considered, and then, with this information, the expression of all instructions
are considered. Predicates usually appear in the context of boundary kernels to distinguish between different parts of
the boundary. If an instruction has a predicate, all sub expression of its right hand side need to be nested within at
least the inames of the predicate. In both steps, the collecting of the iname sets is done through a depth first AST
tree traversal and merging the sets of the children nodes to compute the iname set of the parent node. The iname set for
each node is saved into a dict. For the tree traversal, in addition to the usual leaf nodes of a pymbolic AST, function
nodes are considered leaf nodes, so that these nodes are only considered as a whole in the following steps. The
traversal is implemented as a pymbolic visitor (also called mapper), inheriting from the `CombineMapper`

, which already
implements the base functionality.

## Rewriting Expressions

In the next part of the algorithm, these iname sets are used to rewrite the expression of all instructions to mirror the
nesting of the iname sets. This allows, later on, an easier extraction of nodes. The rewriting is implemented again as a
depth first tree traversal, extending existing pymbolic mappers. The approach here is to first split the children of a
node into groups of nodes with the same iname sets. If the number of groups is `> 1`

and `< #children`

, the groups are
sorted according to the length of their corresponding iname sets. This is done because of the assumption that the iname
sets are perfectly nested, i.e. the sets can be sorted such that `is_1 <= is_2 <= is_3 ...`

. If this assumption does not
hold an exception is raised and the LICM algorithm is aborted. In that case a fallback CSE algorithm without code motion
is used. This case usually appears in jacobian kernel, since the `i, j`

loop of the matrix index has no obvious
nesting. If the assumption holds, the groups are separatly processed and the results are chained together to recreate
the nesting of the iname sets. For example, a expression `a + b + c + d`

may be refactored as `a + ((b + c) + d)`

,
meaning that `a`

is contained in the smalles (least constrained) iname set, `b, c, d`

all have larger iname sets and
`b, c`

are within the same iname set. Currently this rewriting is only applied to `sum`

and `product`

nodes. For these
new expression the minimal iname sets have to be recomputed.

With the rewritten expressions, the cost of each sub expression, with respect to the instruction it is contained in, is computed. The cost is defined as the size of the iname set of the instruction without the minimal iname set of the sub expression, i.e. the size of the superfluous iname set. The actual size of that set is computed using isl. If the iname set contains parameterized inames (inames without compile time bounds) the parameter of that iname is fixed to two, which may be not reasonable in case of DG0 basis functions. As before, this is implemented as a tree traversal algorithm and in addition to the cost of each sub expression the instruction id which contains the sub expression is cached.

## Building Sub Expression DAGs

The set of all sub expression with cost `> 1`

is preprocessed to build a dependency graph between sub expression,
i.e. for each sub expression which sub expression it contains and vice versa, and filter unnecessary sub expressions
out. The graphs used for this purpose are realized through dicts with nodes as keys and edges as values. The starting
point is a directed acyclic graph representing the parent-child relation for each sub expression, i.e. the children of a
node are the children of the parent sub expression. The graph is build through tree traversal, similar to the earlier
cases. During the traversal, sub expressions which can appear outside any other sub expression are cached as top level
sub expression. It should be noted that a top level sub expression may still appear within other sub expressions, and
are therefor not simply the roots of the graph. Only certain pymbolic expressions may appear as node in this graph,
currently these are `sum, product, quotient, call`

are considered. Other pymbolic expressions, which also contain some
notion of children expression may be added if they are needed.

The parent-child DAG is inverted to create a child-parent DAG, which represents for each sub expression all sub expression which directly include the first one. This DAG is needed to add the precomputed sub expressions in a bottom-up fashion later on. Both DAG are reduced to eliminate nodes with children which is nested in the same iname set as the parent node. This is realized through a breadth-first traversal of the child-parent DAG and creates a new parent-child DAG. It could be modified to only eliminate nodes with only one child which is also in the same iname set, but in practice this approach resulted in a high number of temporaries. My guess is that the c++ compiler is better equiped to decide how CSE should be handled within each loop level, since it probably also consideres other factor than the sole number of appearences, like number of in-use registers. The new reduced child-parent DAG is computed by inverting the reduced parent-child DAG.

As an intermediate step a mapping from sub expression to name and id for its assignment instruction is created. Then a new DAG, mirroring the parent-child DAG, is created, where each sub expression is replaced by its name and id.

## Build New Kernel

With the transformed parent-child DAG and the mapping from sub expression to names and ids the sub expressions can be
precomputed. This is implemented as a breadth-first tree traversal of the inverse DAG of the parent-child DAG. The
precomputation results in new instructions, a replacement dict for the corresponding sub expressions and a dependency
DAG, capturing depencencies for each precomputation on both other precomputations and existing instructions. If a sub
expression appears in multiple instructions with different predicates the all predicates are combined with `or`

for the
precomputation instruction. Special care needs to be applied if the precomputed sub expression depends on a vectorized
iname. If this is the case, the shape and the dimension tags must be set explicitly according to the vectorization
width, since the kernel is already preprocessed.

Next, the new schedule is computed, using the dependency DAG from before. The schedule index for each precomputation must satisfy the following conditions, the precomputation must be scheduled after the maximal index of existing instructions it depends on, it must be scheduled after the precomputation of new cse temporaries it depends on, it must be scheduled before the minimal index of existing instructions, which depend on it, and it must be scheduled within the correct inames. The condition of the scheduling before instructions, depending on the new precomputation, can be easily satisfied by using the minimal schedule index, which satisfies the remaining three conditions. The new schedule is created by determining for each precomputation the schedule index satisfying all conditions and then inserting each new schedule item at the coresponding position.

To ensure that for each precomputation all other new precomputations, the first one depends on, are already scheduled, the inverted dependency DAG is traversed in a breadth-first manner, with the modification that new nodes are added to the queue only if all children of the new node have been handled already. The maximal index of the existing instructions is trivially found, but the index of the precomputation is more difficult to find, since the schedule is updated only after all new schedule indices have been found and therefore precomputations, for which the index has already been decided, are not part of the schedule. To circumvent this problem, the indices are gathered in an ordered dict and therefore if the highest dependency index stems from a new precomputation this can be used as the new schedule index, since the order is preserved during the insertion.

The last step is to transform the instruction containing a precomputed sub expression using the replacement dict from before. The replacement mapper used for this differs in one detail from the standard pymbolic replacement mapper, which is that it doesn't flatten products or sums, since the rewritten expressions from earlier need to be left intact for the replacement dict to work. Combining these instructions with the precomputation instructions, the ignored instructions from the beginning, and the new schedule results in the finished kernel.

If for some reasons, the kernel does not have a schedulable loop nesting, the new kernel is discarded and the fallback implementation without code motion is used. This is a remnant of an earlier implementation and should not happen anymore.

# Shortcomings

- Nested sub expressions can only be handled in an all or nothing approach, either all nested sub expressions are precomputed if they appear more than once, regardless of the loop nests, or only nested sub expressions, which can be hoisted out of the parent loop nest, are precomputed. An approach inbetween would need to estimate somehow how many and which precomputions are reasonable.
- Although this situation never appeared, this approach probably cannot handle one cse within multiple instructions with different iname sets.
- This can only precompute scalar expressions. In some cases it would be beneficial to precompute multidimensional expressions, e.g. scaled basis gradient, but this is currently not implemented.