On 09/24/2014 10:45 AM, Fabien COELHO wrote:
> Currently these distributions are achieved by mapping a continuous
> function onto integers, so that neighboring integers get neighboring
> number of draws, say with size=7:
>
> #draws 10 6 3 1 0 0 0 // some exponential distribution
> int drawn 0 1 2 3 4 5 6
>
> Although having an exponential distribution of accesses on tuples is quite
> reasonable, the likelyhood there would be so much correlation between
> neighboring values is not realistic at all. You need some additional
> shuffling to get there.
>
>> I don't understand what that pseudo-random stage you're talking about is. Can
>> you elaborate?
>
> The pseudo random stage is just a way to scatter the values. A basic
> approach to achieve this is "i' = (i * large-prime) % size", if you have a
> modulo. For instance with prime=5 you may get something like:
>
> #draws 10 6 3 1 0 0 0
> int drawn 0 1 2 3 4 5 6 (i)
> scattered 0 5 3 1 6 4 2 (i' = 5 i % 7)
>
> So the distribution becomes:
>
> #draws 10 1 0 3 0 6 0
> scattered 0 1 2 3 4 5 6
>
> Which is more interesting from a testing perspective because it removes
> the neighboring value correlation.
Depends on what you're testing. Yeah, shuffling like that makes sense
for a primary key. Or not: very often, recently inserted rows are also
queried more often, so that there is indeed a strong correlation between
the integer key and the access frequency. Or imagine that you have a
table that stores the height of people in centimeters. To populate that,
you would want to use a gaussian distributed variable, without shuffling.
For shuffling, perhaps we should provide a pgbench function or operator
that does that directly, instead of having to implement it using * and
%. Something like hash(x, min, max), where x is the input variable
(gaussian distributed, or whatever you want), and min and max are the
range to map it to.
> I must say that I'm appaled by a decision process which leads to such
> results, with significant patches passed, and the tiny complement to make
> it really useful (I mean not on the paper or on the feature list, but in
> real life) is rejected...
The idea of a modulo operator was not rejected, we'd just like to have
the infrastructure in place first.
- Heikki