Re: Selecting K random rows - efficiently!

Поиск
Список
Период
Сортировка
От vincent
Тема Re: Selecting K random rows - efficiently!
Дата
Msg-id 24973.84.107.170.210.1193735283.squirrel@webmail.xs4all.nl
обсуждение исходный текст
Ответ на Selecting K random rows - efficiently!  (cluster <skrald@amossen.dk>)
Ответы Re: Selecting K random rows - efficiently!  ("Pavel Stehule" <pavel.stehule@gmail.com>)
Список pgsql-general
> 2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>:
>> In article <ffnid8$1q2t$1@news.hub.org>, cluster  <skrald@amossen.dk>
>> wrote:
>> % > How important is true randomness?
>> %
>> % The goal is an even distribution but currently I have not seen any way
>> % to produce any kind of random sampling efficiently. Notice the word
>>
>> How about generating the ctid randomly? You can get the number of pages
>> from pg_class and estimate the number of rows either using the number
>> of tuples in pg_class or just based on what you know about the data.
>> Then just generate two series of random numbers, one from 0 to the
>> number
>> of pages and the other from 1 to the number of rows per page, and keep
>> picking rows until you have enough numbers. Assuming there aren't too
>> many dead tuples and your estimates are good, this should retrieve n
>> rows
>> with roughly n look-ups.
>>
>> If your estimates are low, there will be tuples which can never be
>> selected,
>> and so far as I know, there's no way to construct a random ctid in a
>> stock
>> postgres database, but apart from that it seems like a good plan. If
>> efficiency is important, you could create a C function which returns a
>> series of random tids and join on that.
>> --
>>
>
> SELECT id, ...
>    FROM data
>   WHERE id = ANY(ARRAY(
>                        SELECT (random()*max_id)::int
>                           FROM generate_series(1,20)))
>   LIMIT 1;
>
> -- max_id is external constant
>
> Pavel Stehule

That selects records where the id is one of twenty random numbers between
zero and the maximum id. That's not truely random, nor is it completely
safe if there are lots of gaps in the values of id. For example, if the
lowest id  is 50000 and the highest is 50040, this will be very likely to
generate lots of numbers below 50000 and find no records at all.


В списке pgsql-general по дате отправления:

Предыдущее
От: "Pavel Stehule"
Дата:
Сообщение: Re: Selecting K random rows - efficiently!
Следующее
От: Alexis Beuraud
Дата:
Сообщение: Checking empty array