Profiling of sumfactorized code generation

Code generation already takes an excessive amount of time due to the sheer number of kernels to generate. This will become even (much!) worse in the future on unstructured grids. Statistic profiling could give insight into where we spend most of the time and if some results could be cached between kernels.