Patch rebased to current master is attached. I'm going to improve my testing script and post new results.
New benchmarking script and results are attached. There new dataset parameter is introduced: skew factor. Skew factor defines skew in distribution of groups sizes.
My idea of generating is just usage of power function where power is from 0 to 1. Following formula is used to get group number for particular item number i.
For example, power = 1/6 gives following distribution of groups sizes:
group number group size
0 2
163
2665
33367
411529
531031
670993
7144495
8269297
9468558
For convenience, instead of power itself, I use skew factor where power = 1.0 / (1.0 + skew). Therefore, with skew = 0.0, distribution of groups sizes is uniform. Larger skew gives more skewed distribution (and that seems to be quite intuitive). For, negative skew, group sizes are mirrored as for corresponding positive skew. For example, skew factor = -5.0 gives following groups sizes distribution:
group number group size
0468558
1269297
2144495
370993
431031
511529
63367
7665
863
92
Results shows that between 2172 test cases, in 2113 incremental sort gives speedup while in 59 it causes slowdown. The following 4 test cases show most significant slowdown (>10% of time).
As you can see, 3 of this 4 test cases have skewed distribution while one of them is related to costly location-aware comparison of text. I've no particular idea of how to cope these slowdowns. Probably, it's OK to have slowdown in some cases while have speedup in majority of cases (assuming there is an option to turn off new behavior). Probably, we should teach optimizer more about skewed distributions of groups, but that doesn't seem feasible for me.