Re: [HACKERS] PATCH: multivariate histograms and MCV lists
От | Tomas Vondra |
---|---|
Тема | Re: [HACKERS] PATCH: multivariate histograms and MCV lists |
Дата | |
Msg-id | 8ac8bd94-478d-215d-e6bd-339f1f20a74c@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
(Dean Rasheed <dean.a.rasheed@gmail.com>)
Re: [HACKERS] PATCH: multivariate histograms and MCV lists (Thomas Munro <thomas.munro@enterprisedb.com>) |
Список | pgsql-hackers |
Hi, Attached is an updated version of the patch series, adopting a couple of improvements - both for MCV lists and histograms. MCV --- For the MCV list part, I've adopted the approach proposed by Dean, using base selectivity and using it to correct the non-MCV part. I agree the simplicity of the approach is a nice feature, and it seems to produce better estimates. I'm not sure I understand the approach perfectly, but I've tried to add comments explaining how it works etc. I've also changed how we build the MCV lists, particularly how we decide how many / which items store in the MCV list. In the previous version I've adopted the same algorithm we use for per-column MCV lists, but in certain cases that turned out to be too restrictive. Consider for example a table with multiple perfectly correlated columns, with very few combinations. That is, something like this: CREATE TABLE t (a int, b int); INSERT INTO t SELECT mod(i,50), mod(i,50) FROM generate_series(1,1e6) s(i); CREATE STATISTICS s (mcv) ON a,b FROM t; Now, the data distribution is very simple - uniform, with 50 distinct combinations, each representing 2% of data (and the random sample should be pretty close to that). In these cases, analyze_mcv_list decides it does not need any MCV list, because the frequency for each value is pretty much 1/ndistinct. For single column that's reasonable, but for multiple correlated columns it's rather problematic. We might use the same ndistinct approach (assuming we have the ndistinct coefficients), but that still does not allow us to decide which combinations are "valid" with respect to the data. For example we can't decide (1,10) does not appear in the data. So I'm not entirely sure adopting the same algorithm analyze_mcv_list algorithm both for single-column and multi-column stats. It may make sense to keep more items in the multi-column case for reasons that are not really valid for a single single-column. For now I've added a trivial condition to simply keep all the groups when possible. This probably needs more thought. BTW Dean's patch also modified how the maximum number of items on a MCV list is determined - instead of the shaky defaults I used before, it derives the size from attstattarget values for the columns, keeping the maximum value. That seems to make sense, so I've kept this. histograms ---------- For histograms, I've made the two improvements I mentioned previously. Firstly, simple equality conditions (of the form "var = const") are estimated using as 1/ndistinct (possibly using ndistinct coefficients when available), and then used only as "conditions" (in the "conditional probability" sense) when estimating the rest of the clauses using the histogram. That is P(clauses) is split into two parts P(clauses) = P(equalities) * P(remaining|clauses) where the first part is estimated as 1/ndistinct, the second part is estimated using histogram. I'm sure this needs more thought, particularly when combining MCV and histogram estimates. But in general it seems to work quite nicely. The second improvement is about estimating what fraction of a bucket matches the conditions. Instead of using the rough 1/2-bucket estimate, I've adopted the convert_to_scalar approach, computing a geometric mean for all the clauses (at a bucket level). I'm not entirely sure the geometric mean is the right approach (or better than simply using 1/2 the bucket) because multiplying the per-clause frequencies is mostly equal to assumption of independence at the bucket level. Which is rather incompatible with the purpose of multi-column statistics, which are meant to be used exactly when the columns are not independent. measurements ------------ I think we need to maintain a set of tests (dataset + query), so that we can compare impact of various changes in the algorithm. So far we've used mostly ad-hoc queries, often created as counter-examples, and that does not seem very practical. So I'm attaching a simple SQL script that I consider an initial version of that. It has a couple of synthetic data sets, and queries estimated with and without extended statistics. I'm also attaching a spreadsheet with results for (a) the original version of the patch series, as submitted on 6/24, (b) the new version attached here and (c) the new version using the per-bucket estimates directly, without the geometric mean. Overall, the new versions seem to perform better than the version from 6/24, and also compared to only per-column statistics. There are cases where extended statistic produce over-estimates, but I find it somewhat natural due to lower resolution of the multi-column stats. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: Tomas VondraДата:
Сообщение: Re: memory leak when serializing TRUNCATE in reorderbuffer