Re: Selecting K random rows - efficiently!

Поиск
Список
Период
Сортировка
От John D. Burger
Тема Re: Selecting K random rows - efficiently!
Дата
Msg-id 6134F4E4-6FA3-46E9-A1EE-6DBA015E60EF@mitre.org
обсуждение исходный текст
Ответ на Re: Selecting K random rows - efficiently!  ("Scott Marlowe" <scott.marlowe@gmail.com>)
Ответы Re: Selecting K random rows - efficiently!
Список pgsql-general
As far as I can tell, all of the proposed solutions lack sample
independence.  Take the OP's suggested approach of doing something
like this:

   SELECT * FROM mydata
   WHERE mydata.random_number >= (SELECT RANDOM() OFFSET 0)
   ORDER BY mydata.random_number ASC LIMIT 100

All you're doing is picking random =subsequences= from the same
permutation of the original data.  This is not the same as a random
sample.  That is, if rows A and B are adjacent in the permutation,
then if A is in the sample, B will also be in it with very high
probability, depending on the size of the sample.  Another way of
saying this is that the first element of the sample is selected
randomly, the rest are completely deterministic.  In a true random
sample, different elements are selected independently.

On the other hand, ORDER BY RANDOM() does indeed construct true
random samples, because it makes a new permutation every time.  If
you want to use the random_number column approach, then you need to
do the same.  You can accomplish this by sampling from the original
permutation repeatedly, doing the above with LIMIT 1 as many times as
you need.  Yes this is more costly, but TANSTAAFL.

As is often observed, it's easy to create the appearance of
randomness, harder to accomplish in reality.

- John Burger
   MITRE

В списке pgsql-general по дате отправления:

Предыдущее
От: "D. Dante Lorenso"
Дата:
Сообщение: Re: Selecting tree data
Следующее
От: Ow Mun Heng
Дата:
Сообщение: Query_time SQL as a function w/o creating a new type