|
|
# Input
|
|
|
The LICM algorithm is implemented as a series of loopy kernel transformations. The input loopy kernel has already been
|
|
|
vectorized and other transformations, like constant folding (e.g. `2 * 4 -> 8`), have been applied. The kernel is
|
|
|
preprocessed and scheduled before the actual algorithm. If the input kernel already has a schedule, this is a no-op.
|
|
|
|
|
|
|
|
|
# Algorithm
|
|
|
As a preliminary step, the kernel instructions are split into instructions, which can be modified through CSE,
|
|
|
`Assignment` and `CallInstruction`, and which cannot, everything else, but most importantly `CInstructions`. The latter
|
|
|
instructions are ignored during the rest of the algorithm and added the resulting kernel.
|
|
|
|
|
|
The algorithm consists of the following steps:
|
|
|
- determine minimal iname set for all sub expressions
|
|
|
- use loopy to generate a schedule for the kernel
|
|
|
- deduce iname nesting from schedule
|
|
|
- valid scope of temporaries
|
|
|
- rewrite each sub expression to group expression with the same minimal iname set
|
|
|
- update minimal iname sets
|
|
|
- compute cost of each sub expression within its instruction
|
|
|
- build sub expression graph of cses to extract
|
|
|
- build new kernel
|
|
|
- precompute cses
|
|
|
- add cses to schedule
|
|
|
- add cses to original instructions
|
|
|
|
|
|
## Minimal Iname Sets
|
|
|
The first step is to determine for all sub expression the minimal set of inames it depends on. A sub expression denotes
|
|
|
a node in the pymbolic AST. This needs additional information, which is not present in the input kernel. The first one
|
|
|
is the nesting of the loop domains and the second is the scope of temporary variables.
|
|
|
|
|
|
The nesting of the loop domains is simply given from loopy. It should be noted, that the computation of the schedule is
|
|
|
extremely slow compared to the rest of the cse algorithm, but it is a necessary step in the code generation process,
|
|
|
even if LICM is not applied. As a result, the LICM transformation can take up a significant (>50%) part of the runtime.
|
|
|
|
|
|
If an sub expression contains a temporary variable its inames must be nested within the valid scope of the
|
|
|
temporary. For a given temporary its valid scope is the iname set which initializes the variable, discarding possible
|
|
|
indexing inames of the temporary. As an example, in the most common case of the value or gradient of the trial function
|
|
|
with the following loop nest
|
|
|
|
|
|
```
|
|
|
for q
|
|
|
for ex, ey
|
|
|
u = reduction((ix, ey), phi[q, (ix, iy)] * x[ix, iy, ex, ey])
|
|
|
for d
|
|
|
grad_u[d] = reduction((ix, ey), phi[q, d, (ix, iy)] * x[ix, iy, ex, ey])
|
|
|
```
|
|
|
|
|
|
the valid scope of both variables is the set `q, ex, ey`. Other examples for temporaries may include the macro-element
|
|
|
corners, which are initialzed completly outside of the quadrature loop and therefore their valid scope is the entire
|
|
|
kernel.
|
|
|
|
|
|
With these additional informations the inames a sub expression depends are determined. This is done in two steps, first
|
|
|
the precidates of all instructions are considered, and then, with this information, the expression of all instructions
|
|
|
are considered. Predicates usually appear in the context of boundary kernels to distinguish between different parts of
|
|
|
the boundary. If an instruction has a predicate, all sub expression of its right hand side need to be nested within at
|
|
|
least the inames of the predicate. In both steps, the collecting of the iname sets is done through a depth first AST
|
|
|
tree traversal and merging the sets of the children nodes to compute the iname set of the parent node. The iname set for
|
|
|
each node is saved into a dict. For the tree traversal, in addition to the usual leaf nodes of a pymbolic AST, function
|
|
|
nodes are considered leaf nodes, so that these nodes are only considered as a whole in the following steps. The
|
|
|
traversal is implemented as a pymbolic visitor (also called mapper), inheriting from the `CombineMapper`, which already
|
|
|
implements the base functionality.
|
|
|
|
|
|
## Rewriting Expressions
|
|
|
In the next part of the algorithm, these iname sets are used to rewrite the expression of all instructions to mirror the
|
|
|
nesting of the iname sets. This allows, later on, an easier extraction of nodes. The rewriting is implemented again as a
|
|
|
depth first tree traversal, extending existing pymbolic mappers. The approach here is to first split the children of a
|
|
|
node into groups of nodes with the same iname sets. If the number of groups is `> 1` and `< #children`, the groups are
|
|
|
sorted according to the length of their corresponding iname sets. This is done because of the assumption that the iname
|
|
|
sets are perfectly nested, i.e. the sets can be sorted such that `is_1 <= is_2 <= is_3 ...`. If this assumption does not
|
|
|
hold an exception is raised and the LICM algorithm is aborted. In that case a fallback CSE algorithm without code motion
|
|
|
is used. This case usually appears in jacobian kernel, since the `i, j` loop of the matrix index has no obvious
|
|
|
nesting. If the assumption holds, the groups are separatly processed and the results are chained together to recreate
|
|
|
the nesting of the iname sets. For example, a expression `a + b + c + d` may be refactored as `a + ((b + c) + d)`,
|
|
|
meaning that `a` is contained in the smalles (least constrained) iname set, `b, c, d` all have larger iname sets and
|
|
|
`b, c` are within the same iname set. Currently this rewriting is only applied to `sum` and `product` nodes. For these
|
|
|
new expression the minimal iname sets have to be recomputed.
|
|
|
|
|
|
With the rewritten expressions, the cost of each sub expression, with respect to the instruction it is contained in, is
|
|
|
computed. The cost is defined as the size of the iname set of the instruction without the minimal iname set of the sub
|
|
|
expression, i.e. the size of the superfluous iname set. The actual size of that set is computed using isl. If the iname
|
|
|
set contains parameterized inames (inames without compile time bounds) the parameter of that iname is fixed to two,
|
|
|
which may be not reasonable in case of DG0 basis functions. As before, this is implemented as a tree traversal algorithm
|
|
|
and in addition to the cost of each sub expression the instruction id which contains the sub expression is cached.
|
|
|
|
|
|
## Building Sub Expression DAGs
|
|
|
The set of all sub expression with cost `> 1` is preprocessed to build a dependency graph between sub expression,
|
|
|
i.e. for each sub expression which sub expression it contains and vice versa, and filter unnecessary sub expressions
|
|
|
out. The graphs used for this purpose are realized through dicts with nodes as keys and edges as values. The starting
|
|
|
point is a directed acyclic graph representing the parent-child relation for each sub expression, i.e. the children of a
|
|
|
node are the children of the parent sub expression. The graph is build through tree traversal, similar to the earlier
|
|
|
cases. During the traversal, sub expressions which can appear outside any other sub expression are cached as top level
|
|
|
sub expression. It should be noted that a top level sub expression may still appear within other sub expressions, and
|
|
|
are therefor not simply the roots of the graph. Only certain pymbolic expressions may appear as node in this graph,
|
|
|
currently these are `sum, product, quotient, call` are considered. Other pymbolic expressions, which also contain some
|
|
|
notion of children expression may be added if they are needed.
|
|
|
|
|
|
The parent-child DAG is inverted to create a child-parent DAG, which represents for each sub expression all sub
|
|
|
expression which directly include the first one. This DAG is needed to add the precomputed sub expressions in a
|
|
|
bottom-up fashion later on. Both DAG are reduced to eliminate nodes with children which is nested in the same iname set
|
|
|
as the parent node. This is realized through a breadth-first traversal of the child-parent DAG and creates a new
|
|
|
parent-child DAG. It could be modified to only eliminate nodes with only one child which is also in the same iname set,
|
|
|
but in practice this approach resulted in a high number of temporaries. My guess is that the c++ compiler is better
|
|
|
equiped to decide how CSE should be handled within each loop level, since it probably also consideres other factor than
|
|
|
the sole number of appearences, like number of in-use registers. The new reduced child-parent DAG is computed by
|
|
|
inverting the reduced parent-child DAG.
|
|
|
|
|
|
As an intermediate step a mapping from sub expression to name and id for its assignment instruction is created. Then a
|
|
|
new DAG, mirroring the parent-child DAG, is created, where each sub expression is replaced by its name and id.
|
|
|
|
|
|
## Build New Kernel
|
|
|
With the transformed parent-child DAG and the mapping from sub expression to names and ids the sub expressions can be
|
|
|
precomputed. This is implemented as a breadth-first tree traversal of the inverse DAG of the parent-child DAG. The
|
|
|
precomputation results in new instructions, a replacement dict for the corresponding sub expressions and a dependency
|
|
|
DAG, capturing depencencies for each precomputation on both other precomputations and existing instructions. If a sub
|
|
|
expression appears in multiple instructions with different predicates the all predicates are combined with `or` for the
|
|
|
precomputation instruction. Special care needs to be applied if the precomputed sub expression depends on a vectorized
|
|
|
iname. If this is the case, the shape and the dimension tags must be set explicitly according to the vectorization
|
|
|
width, since the kernel is already preprocessed.
|
|
|
|
|
|
Next, the new schedule is computed, using the dependency DAG from before. The schedule index for each precomputation
|
|
|
must satisfy the following conditions, the precomputation must be scheduled after the maximal index of existing
|
|
|
instructions it depends on, it must be scheduled after the precomputation of new cse temporaries it depends on, it must
|
|
|
be scheduled before the minimal index of existing instructions, which depend on it, and it must be scheduled within the
|
|
|
correct inames. The condition of the scheduling before instructions, depending on the new precomputation, can be easily
|
|
|
satisfied by using the minimal schedule index, which satisfies the remaining three conditions. The new schedule is
|
|
|
created by determining for each precomputation the schedule index satisfying all conditions and then inserting each new
|
|
|
schedule item at the coresponding position.
|
|
|
|
|
|
To ensure that for each precomputation all other new precomputations, the first one depends on, are already scheduled,
|
|
|
the inverted dependency DAG is traversed in a breadth-first manner, with the modification that new nodes are added to
|
|
|
the queue only if all children of the new node have been handled already. The maximal index of the existing instructions
|
|
|
is trivially found, but the index of the precomputation is more difficult to find, since the schedule is updated only
|
|
|
after all new schedule indices have been found and therefore precomputations, for which the index has already been
|
|
|
decided, are not part of the schedule. To circumvent this problem, the indices are gathered in an ordered dict and
|
|
|
therefore if the highest dependency index stems from a new precomputation this can be used as the new schedule index,
|
|
|
since the order is preserved during the insertion.
|
|
|
|
|
|
The last step is to transform the instruction containing a precomputed sub expression using the replacement dict from
|
|
|
before. The replacement mapper used for this differs in one detail from the standard pymbolic replacement mapper, which
|
|
|
is that it doesn't flatten products or sums, since the rewritten expressions from earlier need to be left intact for the
|
|
|
replacement dict to work. Combining these instructions with the precomputation instructions, the ignored instructions
|
|
|
from the beginning, and the new schedule results in the finished kernel.
|
|
|
|
|
|
If for some reasons, the kernel does not have a schedulable loop nesting, the new kernel is discarded and the fallback
|
|
|
implementation without code motion is used. This is a remnant of an earlier implementation and should not happen
|
|
|
anymore.
|
|
|
|
|
|
# Shortcomings
|
|
|
- Nested sub expressions can only be handled in an all or nothing approach, either all nested sub expressions are
|
|
|
precomputed if they appear more than once, regardless of the loop nests, or only nested sub expressions, which
|
|
|
can be hoisted out of the parent loop nest, are precomputed. An approach inbetween would need to estimate somehow
|
|
|
how many and which precomputions are reasonable.
|
|
|
- Although this situation never appeared, this approach probably cannot handle one cse within multiple instructions
|
|
|
with different iname sets.
|
|
|
- This can only precompute scalar expressions. In some cases it would be beneficial to precompute multidimensional
|
|
|
expressions, e.g. scaled basis gradient, but this is currently not implemented. |