Re: More thoughts about planner's cost estimates

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: More thoughts about planner's cost estimates
Дата
Msg-id 8764jkhgwb.fsf@stark.xeocode.com
обсуждение исходный текст
Ответ на Re: More thoughts about planner's cost estimates  (Josh Berkus <josh@agliodbs.com>)
Ответы Re: More thoughts about planner's cost estimates  (Josh Berkus <josh@agliodbs.com>)
Re: More thoughts about planner's cost estimates  ("Jim C. Nasby" <jnasby@pervasive.com>)
Список pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:

> 1. n-distinct estimation is bad, as previously discussed;
> 
> 2. our current heuristics sampling methods prevent us from sampling more than
> 0.5% of any reasonably large table, causing all statistics on those tables to
> be off for any table with irregular distribution of values;

I'm convinced these two are more connected than you believe. For distributions
of values in a range using small samples is on solid statistical basis. It's
the same reason poll results based on only a few hundred people can accurately
predict elections in a country with hundreds of millions of voters.

However for things like estimating discrete values there's absolutely no solid
foundation for these small samples. Your accuracy is going to be simply
proportional to the sample size, meaning you can't get anything trustworthy
without sampling large enough portions of the table that a full sequential
scan would be just as fast.

I might be interested in implementing that algorithm that was posted a while
back that involved generating good unbiased samples of discrete values. The
algorithm was quite clever and well described and paper claimed impressively
good results.

However it will only make sense if people are willing to accept that analyze
will need a full table scan -- at least for tables where the DBA knows that
good n_distinct estimates are necessary.

> 3. We don't have any method to analyze inter-column correlation within a table;
> 
> 4. We don't keep statistics on foriegn key correlation;

Gosh these would be nice but they sound like hard problems. Has anybody even
made any headway in brainstorming how to tackle them?

> 5. random_page_cost (as previously discussed) is actually a funciton of
> relatively immutable hardware statistics, and as such should not need to exist
> as a GUC once the cost model is fixed.

I don't think that's true at all. Not all hardware is the same.

Certainly the need to twiddle this GUC should be drastically reduced if the
cache effects are modelled properly and the only excuses left are legitimate
hardware differences.

-- 
greg



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tzahi Fadida
Дата:
Сообщение: Re: CTID issues and a soc student in need of help
Следующее
От: Robert Treat
Дата:
Сообщение: stable snapshot looks outdated