Re: Gsoc2012 idea, tablesample

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Gsoc2012 idea, tablesample
Дата
Msg-id CAM-w4HObqcAgM2n=cqym5LtY9oXndOzkQtZJ+11NZbYxKSLEFw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Gsoc2012 idea, tablesample  (Christopher Browne <cbbrowne@gmail.com>)
Ответы Re: Gsoc2012 idea, tablesample  (Josh Berkus <josh@agliodbs.com>)
Список pgsql-hackers
On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> I get the feeling that this is a somewhat-magical feature (in that
> users haven't much hope of understanding in what ways the results are
> deterministic) that is sufficiently "magical" that anyone serious
> about their result sets is likely to be unhappy to use either SYSTEM
> or BERNOULLI.

These both sound pretty useful. "BERNOULLI" is fine for cases where
you aren't worried about time dependency on your data. If you're
looking for the average or total value of some column for example.

SYSTEM just means "I'm willing to trade some unspecified amount of
speed for some unspecified amount of accuracy" which presumably is
only good if you trust the database designers to make a reasonable
trade-off for cases where speed matters and the accuracy requirements
aren't very strict.

> Possibly the forms of sampling that people *actually* need, most of
> the time, are more like Dollar Unit Sampling, which are pretty
> deterministic, in ways that mandate that they be rather expensive
> (e.g. - guaranteeing Seq Scan).

I don't know about that but the cases I would expect to need other
distributions would be ones where you're looking at the tuples in a
non-linear way. Things like "what's the average gap between events" or
"what's the average number of instances per value".  These might
require a full table scan but might still be useful if the data is
going to be subsequently aggregated or joined in ways that would be
too expensive on the full data set.

But we shouldn't let best be the enemy of the good here. Having SYSTEM
and BERNOULLI would solve most use cases and having those would make
it easier to add more later.

-- 
greg


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap
Следующее
От: Greg Smith
Дата:
Сообщение: Re: Bug tracker tool we need