[bugfix][parallel] Fix partition in rangeSegment() function
This function should distribute a range of length size
into n
segments of approximately the same size.
The current version uses segments of length floor(size/n)
.
The remaining r = size % n
entries are added to the last segment.
For small segment sizes and large modulus r = size % n
this may lead to extremely bad partitions.
E.g. for size=17
and n=9
we get floor(size/9) = 1
such
that we have 8 segements of size 1 and one of size 9.
In contrast, if we use n=8
we get 8 segments of size 2.
I.e when going from n=8
to n=9
threads, the runtime
will be about 4 times as long.
This patch changes the partition to use r
segments of size
ceil(size/n)
and n-r
segments of size floor(size/n)
which is as balanced as possible.