Re: Yet another abort-early plan disaster on 9.3

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема Re: Yet another abort-early plan disaster on 9.3
Дата
Msg-id 543EACD2.50004@agliodbs.com
обсуждение исходный текст
Ответ на Yet another abort-early plan disaster on 9.3  (Josh Berkus <josh@agliodbs.com>)
Ответы Re: Yet another abort-early plan disaster on 9.3
Список pgsql-performance
On 10/10/2014 04:16 AM, Greg Stark wrote:
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one.  While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

So, right now our estimation is off on large tables by -10X to -10000X.
 First, the fact that it's *always* low is an indication we're using the
wrong algorithm.  Second, we can most certainly do better than a median
of -1000X.

One interesting set of algorithms is block-based sampling.  That is, you
read 5% of the physical table in random blocks, reading every row in the
block.  The block size is determined by your storage block size, so
you're not actually reading any more physically than you are logically;
it really is just 5% of the table, especially on SSD.

Then you apply algorithms which first estimate the correlation of common
values in the block (i.e. how likely is it that the table is completely
sorted?), and then estimates of how many values there might be total
based on the correlation estimate.

I no longer have my ACM membership, so I can't link this, but
researchers were able to get +/- 3X accuracy for a TPCH workload using
this approach.  A real database would be more variable, of course, but
even so we should be able to achieve +/- 50X, which would be an order of
magnitude better than we're doing now.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


В списке pgsql-performance по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: Partitions and work_mem?
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Yet another abort-early plan disaster on 9.3