BCRSMatrix Performance improvements

In my latest benchmarks, I found that the matrix-vector operations have some deficiencies under certain conditions and can improved. These are the issues that I have found so far:

The row_type stores redundant information. Each row stores two pointers (begin of data, begin of sparse indices) and the size. Two of them are redundant because a vector of offsets (the out-of-the-book CRS implementation) can be produce all of them on the fly. This is problematic because loading them implies more memory traffic on an already memory bound operation. The solution to this, is to store the vector of offsets and create the rows on the fly. The obstacle is that matrix creation is written in terms of row_type. It's a lot of changes, but is possible.
On nested blocked structures (i.e. BCRSMatrix<BCRSMatrix<Field,...>, ...>), the size of BCRSMatrix<Field,...> is so big (~128 bytes) that loading each sub-block a whole cache line is loaded to only read three pointers (8*3 bytes). The solution to this, is to store the additional information on the heap so that it is out of the hot-path when doing blocking.
Also on nested blocked structures. Because row_type is stored using pointers to the actual data, it is not possible to reuse the row pattern when copying matrices with the same pattern. When reusing the vector of offsets, even less memory loads are required. This can be solved when using a shared pointer to the vector of offsets described above (same as how the column indices are handled now).

Benchmark

For a matrix resulting of a structure 2D grid solving a system of N equations using finite differences. N corresponds to the block size on the following figures, and sparsity corresponds to the sparsity of the "sub-blocks". A matrix bigger than L3 cache was used, and the matrix-vector operation was repeated 20 times. Flat matrices with a low number of entries per row get a benefit out of 1, whereas nested blocked matrices get benefit from 1, 2 and 3.

	Flat	Blocked
Sample Pattern (BlockSize=5, BlockSparcity=0.5)
SpeedUp wrt partial implementation

Edited Jan 27, 2023 by Santiago Ospina De Los Ríos