Re: Multidimensional Histograms

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Multidimensional Histograms
Дата
Msg-id d091b383-8589-4970-83f4-93169a525df4@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Multidimensional Histograms  (Alexander Cheshev <alex.cheshev@gmail.com>)
Ответы Re: Multidimensional Histograms  (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
Re: Multidimensional Histograms  (Alexander Cheshev <alex.cheshev@gmail.com>)
Список pgsql-hackers

On 1/6/24 01:00, Alexander Cheshev wrote:
> Hi Tomas,
> 
>> Another reason was that the algorithm described in the two papers you
>> reference (1988 paper by DeWitt and the evaluation by Carlson and
>> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
>> processes the columns one by one - first build bucket on column "a",
>> then splits each bucket into buckets on "b". So it's not symmetrical,
>> and it results in lower accuracy compared to an "ideal" histogram with
>> the same number of buckets (especially for the dimensions split early).
> 
> As stated in [1] Sum Square Error (SSE) is one of the most natural
> error metrics. Equi-Depth Histogram is not ideal because it doesn't
> minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> SSE and from this point of view can be considered as an ideal
> solution.
> 
>> This does indeed produce an equi-depth histogram, which seems nice, but
>> the mesh is regular in such a way that any value of the first dimension
>> intersects all buckets on the second dimension. So for example with a
>> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
>> 100 buckets on "b", each representing 1/10000 of tuples. But we don't
>> know how the tuples are distributed in the tuple - so we end up using
>> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
>> the wrong dimension.
> 
> Suppose that we have a query box a=X and we know how data is
> distributed in buckets. Lets consider only the buckets which are
> intersected by the query box a=X. As we know how data is distributes
> in buckets we can exclude the buckets which have no tuples which
> intersect the query box a=X.
> 
> As we store buckets with no information about data distribution we
> have to make reasonable assumptions. If we keep buckets relativly
> small then we can assume that buckets have uniform distribution.
> 
> What I am trying to say is that the problem which you pointed out
> comes from the fact that we store buckets with no information about
> data distribution. Even in one dimensional case we have to assume how
> data is distributed in buckets. By the way Postgres assumes that data
> has uniform distribution in buckets.
> 

It's not just what Postgres assumes, the assumption bucket uniformity is
somewhat inherent to the whole concept of a histogram. Yes, maybe we
could keep some "distribution" info about each bucket, but then maybe we
could simply build histogram with more bins occupying the same space?

The thing I was proposing is that it should be possible to build
histograms with bins adapted to density in the given region. With
smaller buckets in areas with high density. So that queries intersect
with fewer buckets in low-density parts of the histogram.

>> However, this is not the only way to build an equi-depth histogram,
>> there are ways to make it more symmetrical. Also, it's not clear
>> equi-depth histograms are ideal with multiple columns. There are papers
>> proposing various other types of histograms (using different criteria to
>> build buckets optimizing a different metric). The most interesting one
>> seems to be V-Optimal histograms - see for example papers [1], [2], [3],
>> [4], [5] and [6]. I'm sure there are more. The drawback of course is
>> that it's more expensive to build such histograms.
> 
> Tomas thank you for shearing with me your ideas regarding V-Optimal
> Histogram. I read through the papers which you gave me and came up
> with the following conclusion.
> 
> The problem can be formulated like this. We have N tuples in
> M-dimensional space. We need to partition space into buckets
> iteratively until SSE is less than E or we reach the limit of buckets
> B.
> 

Yes. Although v-optimal histograms minimize variance of frequencies. Not
sure if that's what you mean by SSE.

> In the case of M-dimensional space it seems to me like an NP-hard
> problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
> sort N tuples in ascending order. Then we iteratively select a bucket
> which leads to the largest SSE reduction and split it into two parts.
> We repeat the process until SSE is less than E or we reach the limit
> of buckets B.
> 

I don't recall all the details of the MHIST-2 algorithm, but this sounds
about right. Yes, building the optimal histogram would be NP-hard, so
we'd have to use some approximate / greedy algorithm.

> If we assume that B is significantly less than N then the time
> complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
> equals 3, B equals 1000 and N equals 300*B then it will take slightly
> over 0.9*10^9 iterations to build a V-Optimal Histogram.
> 
> You can see that we have to keep B as low as possible in order to
> build V-Optimal Histogram in feasible time. And here is a paradox.
> From one side we use V-Optimal Histogram in order to minimize SSE but
> from the other side we have to keep B as low as possible which
> eventually leads to increase in SSE.
> 

I don't recall the details of the MHIST-2 scheme, but it's true
calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).

But that's exactly why greedy/approximate algorithms exist. Yes, it's
not going to produce the optimal V-optimal histogram, but so what?

> On the other hand time complexity required to build an Equi-Depth
> Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
> can be arbitrarily reduced by increasing B which in turn is only
> limited by the storage limit. Experimental results show low error
> metric [3].
> 

And how does this compare to the approximate/greedy algorithms, both in
terms of construction time and accuracy?

The [3] paper does not compare those, ofc, and I'm somewhat skeptical
about the results in that paper. Not that they'd be wrong, but AFAICS
the assumptions are quite simplistic and well-suited for that particular
type of histogram.

It's far from clear how would it perform for less "smooth" data,
non-numeric data, etc.

> In Equi-Depth Histogram a bucket is represented by two vectors. The
> first vector points to the left bottom corner of the bucket and the
> other one point to the right top corner of the bucket. Thus space
> complexity of Equi-Depth Histogram can be estimated as
> 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
> integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
> bytes.
> 
> If a bucket is partially intersected by a query box then we assume
> that data has uniform distribution inside of the bucket. It is a
> reasonable assumption if B is relativly large.
> 
> In order to efficianly find buckets which intersect a query box we can
> store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
> takes O(logB) iterations to find buckets which intersect a query box.
> As storage requirements are dominated by leaf nodes we can assume that
> it takes slightly more than 2*integer_size*M*B.
> 

But all of this can apply to histograms in general, no? It's not somehow
special to equi-depth histograms, a v-optimal histogram could be stored
as an r-tree etc.

>> IIRC the patch tried to do something like V-optimal histograms, but
>> using a greedy algorithm and not the exhaustive stuff described in the
>> various papers.
> 
> We should only consider computationally tackable solutions. In one
> dimensional case V-Optimal Histogram is probably a good solution but
> in multi-dimensional case I would only consider Equi-Width or
> Equi-Depth Histograms. As stated in multiple papers Equi-Depth
> Histogram proves to be more accurate than Equi-Width Histogram. By the
> way Postgres uses something like Equi-Width Histogram.
> 

Sure, but AFAIK the greedy/approximate algorithms are not intractable.

And at some point it becomes a tradeoff between accuracy of estimates
and resources to build the histogram. Papers from ~1988 are great, but
maybe not sufficient to make such decisions now.

>> FWIW I did not intend to reject the idea of adding multi-dimensional
>> histograms, but rather to provide some insight into the history of the
>> past attempt, and also point some weaknesses of the algorithm described
>> in the 1988 paper. If you choose to work on this, I can do a review etc.
> 
> Thank you very much Tomas. I am new in the community and I definitely
> didn't expect to have such a warm welcome.
> 
> As I indicated above Equi-Depth Histogram proves to be more accurate
> than Equi-Width Histogram and both have the same time and space
> requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
> that:
>  * I will create a patch which will replace Equi-Width Histogram with
> Equi-Depth Histogram but only in 1-dimensional case.
>  * I will show experimental results which will demonstrate improvement
> of selectivity estimation.
> Then will the path be accepted by the community?
> 
> If the above path is accepted by the community then I will proceed
> further with M-dimensional Equi-Depth Histogram...
> 

I find it very unlikely we want (or even need) to significantly rework
the 1-D histograms we already have. AFAICS the current histograms work
pretty well, and if better accuracy is needed, it's easy/efficient to
just increase the statistics target. I'm sure there are various
estimation issues, but I'm not aware of any that'd be solved simply by
using a different 1-D histogram.

I'm not going to reject that outright - but I think the bar you'd need
to justify such change is damn high. Pretty much everyone is using the
current histograms, which makes it a very sensitive area. You'll need to
show that it helps in practice, without hurting causing harm ...

It's an interesting are for experiments, no doubt about it. And if you
choose to explore it, that's fine. But it's better to be aware it may
not end with a commit.


For the multi-dimensional case, I propose we first try to experiment
with the various algorithms, and figure out what works etc. Maybe
implementing them in python or something would be easier than C.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jelte Fennema-Nio
Дата:
Сообщение: Re: POC: Extension for adding distributed tracing - pg_tracing
Следующее
От: Andy Fan
Дата:
Сообщение: Re: Shared detoast Datum proposal