Re: Suggestions for analyze patch required...

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Suggestions for analyze patch required...
Дата
Msg-id 12756.1074017298@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Suggestions for analyze patch required...  ("Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk>)
Список pgsql-hackers
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> I agree that the custom function needs an input as to the number of rows
> used for analysis, but I think that this is determined by the
> application in question. It may be that while the existing algorithm is
> fine for the existing data types, it may not give accurate enough
> statistics for some custom type that someone will need to create in the
> future (e.g. the 300 * atstattarget estimate for minrows may not be
> valid in some cases).

Exactly.  That equation has to be part of the potentially-datatype-specific
code.  I believe the sampling code is already set up to take the max()
across all the requested sample size values.  The notion here is that if
we need to sample (say) 10000 rows instead of 3000 to satisfy some
particular analysis requirement, we might as well make use of the larger
sample size for all the columns.  You seem to be envisioning fetching a
new sample for each column of the table --- that seems like N times the
work for an N-column table, with little benefit that I can see.

> 1) Modify examine_attribute() so it will return a VacAttrStats structure
> if the column has a valid ANALYZE function OID, and has not been
> dropped. Move all the specific functionality into a new function, assign
> it an OID, and make this the default for existing pg_types.

I was envisioning that the existing examine_attribute() would become the
default datatype-specific routine referenced in pg_type.  Either it, or
a substitute routine written by a datatype author, would be called and
would return a VacAttrStats structure (or NULL to skip analysis).  The
stats structure would indicate the requested sample size and contain a
function pointer to a second function to call back after the sample has
been collected.  The existing compute_xxx_stats functions would become
two examples of this second function.  (The second functions would thus
not require pg_proc entries nor a pg_type column to reference them: the
examine_attribute routine would know which function it wanted called.)
The second functions would return data to be stored into pg_statistic,
using the VacAttrStats structures.

IMHO neither acquisition of the sample rows nor storing of the final
results in pg_statistic should be under the control of the per-datatype
routine, because those are best done in parallel for all columns at
once.

> Finally the way VacAttrStats is defined means that the float * arrays
> are fixed at STATISTIC_NUM_SLOTS elements. For example, what if I want a
> histogram with more than 1000 buckets???

The histogram is still just one array, no?  NUM_SLOTS defines the
maximum number of different arrays you can put into pg_statistic, but
not their dimensions or contents.  I don't see your point.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Mark Cave-Ayland"
Дата:
Сообщение: Re: Suggestions for analyze patch required...
Следующее
От: Dave Cramer
Дата:
Сообщение: failed to re-find parent key