Re: add modulo (%) operator to pgbench

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: add modulo (%) operator to pgbench
Дата
Msg-id 54229CB6.5010608@vmware.com
обсуждение исходный текст
Ответ на Re: add modulo (%) operator to pgbench  (Fabien COELHO <coelho@cri.ensmp.fr>)
Ответы Re: add modulo (%) operator to pgbench  (Fabien COELHO <coelho@cri.ensmp.fr>)
Список pgsql-hackers
On 09/24/2014 10:45 AM, Fabien COELHO wrote:
> Currently these distributions are achieved by mapping a continuous
> function onto integers, so that neighboring integers get neighboring
> number of draws, say with size=7:
>
>     #draws     10 6 3 1 0 0 0  // some exponential distribution
>     int drawn   0 1 2 3 4 5 6
>
> Although having an exponential distribution of accesses on tuples is quite
> reasonable, the likelyhood there would be so much correlation between
> neighboring values is not realistic at all. You need some additional
> shuffling to get there.
>
>> I don't understand what that pseudo-random stage you're talking about is. Can
>> you elaborate?
>
> The pseudo random stage is just a way to scatter the values. A basic
> approach to achieve this is "i' = (i * large-prime) % size", if you have a
> modulo. For instance with prime=5 you may get something like:
>
>     #draws     10 6 3 1 0 0 0
>     int drawn   0 1 2 3 4 5 6 (i)
>     scattered   0 5 3 1 6 4 2 (i' = 5 i % 7)
>
> So the distribution becomes:
>
>     #draws     10 1 0 3 0 6 0
>     scattered   0 1 2 3 4 5 6
>
> Which is more interesting from a testing perspective because it removes
> the neighboring value correlation.

Depends on what you're testing. Yeah, shuffling like that makes sense 
for a primary key. Or not: very often, recently inserted rows are also 
queried more often, so that there is indeed a strong correlation between 
the integer key and the access frequency. Or imagine that you have a 
table that stores the height of people in centimeters. To populate that, 
you would want to use a gaussian distributed variable, without shuffling.

For shuffling, perhaps we should provide a pgbench function or operator 
that does that directly, instead of having to implement it using * and 
%. Something like hash(x, min, max), where x is the input variable 
(gaussian distributed, or whatever you want), and min and max are the 
range to map it to.

> I must say that I'm appaled by a decision process which leads to such
> results, with significant patches passed, and the tiny complement to make
> it really useful (I mean not on the paper or on the feature list, but in
> real life) is rejected...

The idea of a modulo operator was not rejected, we'd just like to have 
the infrastructure in place first.

- Heikki




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Extending COPY TO
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: jsonb format is pessimal for toast compression