Re: benchmarking the query planner
От | Nathan Boley |
---|---|
Тема | Re: benchmarking the query planner |
Дата | |
Msg-id | 6fa3b6e20812111735s193b4d70lb2864abf279ede9b@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: benchmarking the query planner (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: benchmarking the query planner
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
Thanks for the response. >> Well, ISTM there is a profound difference. For scalarineqsel we care >> about the total number of values in a bucket. For eqsel we care about >> the total number of *distinct* values in each bucket > > Really? > Well, we would obviously also care about the total number of values in the buckets if we were trying to use the histogram in eqsel. Isn't a selectivity estimate of x = v as ( the number of values in v's histogram bucket ) / ( number of distinct values in v's histogram bucket ) pretty rational? Thats currently what we do for non-mcv values, except that we look at ndistinct over the whole table instead of individual histogram buckets. >> IMHO, the whole idea of increasing mcv's seems a mistake. Why not use >> the limited storage in pg_statistic to try and estimate the >> selectivity for ranges of values rather than a single value? > > MCVs are useful for questions that are not related to ranges of values > --- an example of something highly useful we do with them is to try to > estimate the fraction of a column that satisfies a LIKE or regex > pattern. > Good point. I guess I was responding to the eqsel benchmarks. I should remember that I don't necessarily know all the places that mcv's are used. > In fact, as I was just pointing out to Bruce, we can compute them and > do useful things with them for datatypes that don't have a defined sort > order and so the whole concept of "range" is meaningless. > Another good point. But don't we have bigger stat problems for datatypes without an ordering relation? IIRC, analyze doesn't even fully compute the mcv list, as that would be N^2 in the sample size. > Now, if your point is that it'd be more flexible to not tie the MCV list > length to the histogram length, you're right. No, my point is just the opposite. I think the two should be *more* tightly linked. It seems that ( at least for eqsel and scalarineqsel ) mcv's should be the values that the histogram can't deal with effectively. ie, there's no ordering relation, there are too many values to fit into a histogram bucket, the histogram eqsel estimate does an especially bad job for a relatively common value, etc. Even now, if there are 100 histogram buckets then any values that occupy more than 1% of the table will be mcv's regardless - why force a value to be an mcv if it only occupies 0.1% of the table? That just bloats pg_statistic and slows down joins unnecessarily. I'll submit a patch for 8.5 and then, hopefully, some simple benchmarks can make my point.. -Nathan
В списке pgsql-hackers по дате отправления: