Marcel Koch · 03d0c4fd
--- a/Documentation-of-Loop-Invariant-Code-Motion-Implementation.md
+++ b/Documentation-of-Loop-Invariant-Code-Motion-Implementation.md
+# Input
+The LICM algorithm is implemented as a series of loopy kernel transformations. The input loopy kernel has already been
+vectorized and other transformations, like constant folding (e.g. `2 * 4 -> 8`), have been applied. The kernel is
+preprocessed and scheduled before the actual algorithm. If the input kernel already has a schedule, this is a no-op.
+
+
+# Algorithm
+As a preliminary step, the kernel instructions are split into instructions, which can be modified through CSE,
+`Assignment` and `CallInstruction`, and which cannot, everything else, but most importantly `CInstructions`. The latter
+instructions are ignored during the rest of the algorithm and added the resulting kernel.
+
+The algorithm consists of the following steps:
+- determine minimal iname set for all sub expressions
+  - use loopy to generate a schedule for the kernel
+  - deduce iname nesting from schedule
+  - valid scope of temporaries
+- rewrite each sub expression to group expression with the same minimal iname set
+  - update minimal iname sets
+- compute cost of each sub expression within its instruction
+- build sub expression graph of cses to extract
+- build new kernel
+  - precompute cses
+  - add cses to schedule
+  - add cses to original instructions
+
+## Minimal Iname Sets
+The first step is to determine for all sub expression the minimal set of inames it depends on. A sub expression denotes
+a node in the pymbolic AST. This needs additional information, which is not present in the input kernel. The first one
+is the nesting of the loop domains and the second is the scope of temporary variables.
+
+The nesting of the loop domains is simply given from loopy. It should be noted, that the computation of the schedule is
+extremely slow compared to the rest of the cse algorithm, but it is a necessary step in the code generation process,
+even if LICM is not applied. As a result, the LICM transformation can take up a significant (>50%) part of the runtime.
+
+If an sub expression contains a temporary variable its inames must be nested within the valid scope of the
+temporary. For a given temporary its valid scope is the iname set which initializes the variable, discarding possible
+indexing inames of the temporary. As an example, in the most common case of the value or gradient of the trial function
+with the following loop nest
+
+```
+for q
+  for ex, ey
+    u = reduction((ix, ey), phi[q, (ix, iy)] * x[ix, iy, ex, ey])
+	for d
+      grad_u[d] = reduction((ix, ey), phi[q, d, (ix, iy)] * x[ix, iy, ex, ey])
+```
+
+the valid scope of both variables is the set `q, ex, ey`. Other examples for temporaries may include the macro-element
+corners, which are initialzed completly outside of the quadrature loop and therefore their valid scope is the entire
+kernel.
+
+With these additional informations the inames a sub expression depends are determined. This is done in two steps, first
+the precidates of all instructions are considered, and then, with this information, the expression of all instructions
+are considered. Predicates usually appear in the context of boundary kernels to distinguish between different parts of
+the boundary. If an instruction has a predicate, all sub expression of its right hand side need to be nested within at
+least the inames of the predicate. In both steps, the collecting of the iname sets is done through a depth first AST
+tree traversal and merging the sets of the children nodes to compute the iname set of the parent node. The iname set for
+each node is saved into a dict. For the tree traversal, in addition to the usual leaf nodes of a pymbolic AST, function
+nodes are considered leaf nodes, so that these nodes are only considered as a whole in the following steps. The
+traversal is implemented as a pymbolic visitor (also called mapper), inheriting from the `CombineMapper`, which already
+implements the base functionality.
+
+## Rewriting Expressions
+In the next part of the algorithm, these iname sets are used to rewrite the expression of all instructions to mirror the
+nesting of the iname sets. This allows, later on, an easier extraction of nodes. The rewriting is implemented again as a
+depth first tree traversal, extending existing pymbolic mappers. The approach here is to first split the children of a
+node into groups of nodes with the same iname sets. If the number of groups is `> 1` and `< #children`, the groups are
+sorted according to the length of their corresponding iname sets. This is done because of the assumption that the iname
+sets are perfectly nested, i.e. the sets can be sorted such that `is_1 <= is_2 <= is_3 ...`. If this assumption does not
+hold an exception is raised and the LICM algorithm is aborted. In that case a fallback CSE algorithm without code motion
+is used. This case usually appears in jacobian kernel, since the `i, j` loop of the matrix index has no obvious
+nesting. If the assumption holds, the groups are separatly processed and the results are chained together to recreate
+the nesting of the iname sets. For example, a expression `a + b + c + d` may be refactored as `a + ((b + c) + d)`,
+meaning that `a` is contained in the smalles (least constrained) iname set, `b, c, d` all have larger iname sets and
+`b, c` are within the same iname set. Currently this rewriting is only applied to `sum` and `product` nodes. For these
+new expression the minimal iname sets have to be recomputed.
+
+With the rewritten expressions, the cost of each sub expression, with respect to the instruction it is contained in, is
+computed. The cost is defined as the size of the iname set of the instruction without the minimal iname set of the sub
+expression, i.e. the size of the superfluous iname set. The actual size of that set is computed using isl. If the iname
+set contains parameterized inames (inames without compile time bounds) the parameter of that iname is fixed to two,
+which may be not reasonable in case of DG0 basis functions. As before, this is implemented as a tree traversal algorithm
+and in addition to the cost of each sub expression the instruction id which contains the sub expression is cached.
+
+## Building Sub Expression DAGs
+The set of all sub expression with cost `> 1` is preprocessed to build a dependency graph between sub expression,
+i.e. for each sub expression which sub expression it contains and vice versa, and filter unnecessary sub expressions
+out. The graphs used for this purpose are realized through dicts with nodes as keys and edges as values. The starting
+point is a directed acyclic graph representing the parent-child relation for each sub expression, i.e. the children of a
+node are the children of the parent sub expression. The graph is build through tree traversal, similar to the earlier
+cases. During the traversal, sub expressions which can appear outside any other sub expression are cached as top level
+sub expression. It should be noted that a top level sub expression may still appear within other sub expressions, and
+are therefor not simply the roots of the graph. Only certain pymbolic expressions may appear as node in this graph,
+currently these are `sum, product, quotient, call` are considered. Other pymbolic expressions, which also contain some
+notion of children expression may be added if they are needed.
+
+The parent-child DAG is inverted to create a child-parent DAG, which represents for each sub expression all sub
+expression which directly include the first one. This DAG is needed to add the precomputed sub expressions in a
+bottom-up fashion later on. Both DAG are reduced to eliminate nodes with children which is nested in the same iname set
+as the parent node. This is realized through a breadth-first traversal of the child-parent DAG and creates a new
+parent-child DAG. It could be modified to only eliminate nodes with only one child which is also in the same iname set,
+but in practice this approach resulted in a high number of temporaries. My guess is that the c++ compiler is better
+equiped to decide how CSE should be handled within each loop level, since it probably also consideres other factor than
+the sole number of appearences, like number of in-use registers. The new reduced child-parent DAG is computed by
+inverting the reduced parent-child DAG.
+
+As an intermediate step a mapping from sub expression to name and id for its assignment instruction is created. Then a
+new DAG, mirroring the parent-child DAG, is created, where each sub expression is replaced by its name and id.
+
+## Build New Kernel
+With the transformed parent-child DAG and the mapping from sub expression to names and ids the sub expressions can be
+precomputed. This is implemented as a breadth-first tree traversal of the inverse DAG of the parent-child DAG. The
+precomputation results in new instructions, a replacement dict for the corresponding sub expressions and a dependency
+DAG, capturing depencencies for each precomputation on both other precomputations and existing instructions. If a sub
+expression appears in multiple instructions with different predicates the all predicates are combined with `or` for the
+precomputation instruction. Special care needs to be applied if the precomputed sub expression depends on a vectorized
+iname. If this is the case, the shape and the dimension tags must be set explicitly according to the vectorization
+width, since the kernel is already preprocessed.
+
+Next, the new schedule is computed, using the dependency DAG from before. The schedule index for each precomputation
+must satisfy the following conditions, the precomputation must be scheduled after the maximal index of existing
+instructions it depends on, it must be scheduled after the precomputation of new cse temporaries it depends on, it must
+be scheduled before the minimal index of existing instructions, which depend on it, and it must be scheduled within the
+correct inames. The condition of the scheduling before instructions, depending on the new precomputation, can be easily
+satisfied by using the minimal schedule index, which satisfies the remaining three conditions. The new schedule is
+created by determining for each precomputation the schedule index satisfying all conditions and then inserting each new
+schedule item at the coresponding position.
+
+To ensure that for each precomputation all other new precomputations, the first one depends on, are already scheduled,
+the inverted dependency DAG is traversed in a breadth-first manner, with the modification that new nodes are added to
+the queue only if all children of the new node have been handled already. The maximal index of the existing instructions
+is trivially found, but the index of the precomputation is more difficult to find, since the schedule is updated only
+after all new schedule indices have been found and therefore precomputations, for which the index has already been
+decided, are not part of the schedule. To circumvent this problem, the indices are gathered in an ordered dict and
+therefore if the highest dependency index stems from a new precomputation this can be used as the new schedule index,
+since the order is preserved during the insertion.
+
+The last step is to transform the instruction containing a precomputed sub expression using the replacement dict from
+before. The replacement mapper used for this differs in one detail from the standard pymbolic replacement mapper, which
+is that it doesn't flatten products or sums, since the rewritten expressions from earlier need to be left intact for the
+replacement dict to work. Combining these instructions with the precomputation instructions, the ignored instructions
+from the beginning, and the new schedule results in the finished kernel.
+
+If for some reasons, the kernel does not have a schedulable loop nesting, the new kernel is discarded and the fallback
+implementation without code motion is used. This is a remnant of an earlier implementation and should not happen
+anymore.
+
+# Shortcomings
+- Nested sub expressions can only be handled in an all or nothing approach, either all nested sub expressions are
+  precomputed if they appear more than once, regardless of the loop nests, or only nested sub expressions, which
+  can be hoisted out of the parent loop nest, are precomputed. An approach inbetween would need to estimate somehow
+  how many and which precomputions are reasonable.
+- Although this situation never appeared, this approach probably cannot handle one cse within multiple instructions
+  with different iname sets.
+- This can only precompute scalar expressions. In some cases it would be beneficial to precompute multidimensional
+  expressions, e.g. scaled basis gradient, but this is currently not implemented.