Re: TABLESAMPLE patch

Поиск
Список
Период
Сортировка
От Petr Jelinek
Тема Re: TABLESAMPLE patch
Дата
Msg-id 5497DF3C.60604@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: TABLESAMPLE patch  (Tomas Vondra <tv@fuzzy.cz>)
Ответы Re: TABLESAMPLE patch
Список pgsql-hackers
On 21/12/14 18:38, Tomas Vondra wrote:
> Hi,
>
> On 18.12.2014 13:14, Petr Jelinek wrote:
>> Hi,
>>
>> v2 version of this patch is attached.
>
> I did a review of this v2 patch today. I plan to do a bit more testing,
> but these are my comments/questions so far:

Thanks for looking at it!

>
> (0) There's a TABLESAMPLE page at the wiki, not updated since 2012:
>
>      https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
>
>      We should either update it or mark it as obsolete I guess. Also,
>      I'd like to know what's the status regarding the TODO items
>      mentioned there. Are those still valid with this patch?

I will have to look in more detail over the holidays and update it, but 
the general info about table sampling there applies and will apply to 
any patch trying to implement it.

Quick look on the mail thread suggest that at least the concerns 
mentioned in the mailing list should not apply to this implementation.
And looking at the patch the problem with BERNOULLI sampling should not 
exist either as I use completely different implementation for that.

I do also have some issues with joins which I plan to look at but it's 
different one (my optimizer code overestimates the number of rows)

> (1) The patch adds a new catalog, but does not bump CATVERSION.
>

I thought this was always done by committer?

> (2) The catalog naming (pg_tablesamplemethod) seems a bit awkward,
>      as it squishes everything into a single chunk. That's inconsistent
>      with naming of the other catalogs. I think pg_table_sample_method
>      would be better.

Fair point, but perhaps pg_tablesample_method then as tablesample is 
used as single word everywhere including the standard.

>
> (3) There are a few more strange naming decisions, but that's mostly
>      because of the SQL standard requires that naming. I mean SYSTEM and
>      BERNOULLI method names, and also the fact that the probability is
>      specified as 0-100 value, which is inconsistent with other places
>      (e.g. percentile_cont uses the usual 0-1 probability notion). But
>      I don't think this can be fixed, that's what the standard says.

Yeah, I don't exactly love that either but what standard says, standard 
says.

>
> (4) I noticed there's an interesting extension in SQL Server, which
>      allows specifying PERCENT or ROWS, so you can say
>
>        SELECT * FROM table TABLESAMPLE SYSTEM (25 PERCENT);
>
>      or
>
>        SELECT * FROM table TABLESAMPLE SYSTEM (2500 ROWS);
>
>      That seems handy, and it'd make migration from SQL Server easier.
>      What do you think?

Well doing it exactly this way somewhat kills the extensibility which 
was one of the main goals for me - I allow any kind of parameters for 
sampling and the handling of those depends solely on the sampling 
method. This means that in my approach if you'd want to change what you 
are limiting you'd have to write new sampling method and the query would 
then look like SELECT * FROM table TABLESAMPLE SYSTEM_ROWLIMIT (2500); 
or some such (depending on how you name the sampling method). Or SELECT 
* FROM table TABLESAMPLE SYSTEM (2500, 'ROWS'); if we chose to extend 
the SYSTEM sampling method, that would be also doable.

The reason for this is that I don't want to really limit too much what 
parameters can be send to sampling (I envision even SYSTEM_TIMED 
sampling method that will get limit as time interval for example).

>
> (5) I envision a lot of confusion because of the REPEATABLE clause.
>      With READ COMMITTED, it's not really repeatable because of changes
>      done by the other users (and maybe things like autovacuum). Shall
>      we mention this in the documentation?

Yes docs need improvement and this should be mentioned, also code-docs - 
I found few places which I forgot to update when changing code and now 
have misleading comments.

>
> (6) This seems slightly wrong, because of long/uint32 mismatch:
>
>        long seed = PG_GETARG_UINT32(1);
>
>      I think uint32 would be more appropriate, no?

Probably, although I need long later in the algorithm anyway.

>
> (7) NITPICKING: I think a 'sample_rate' would be a better name here:
>
>        double percent = sampler->percent;

Ok.

>
> (8) NITPICKING: InitSamplingMethod contains a command with ';;'
>
>        fcinfo.arg[i] = (Datum) 0;;

Yeah :)

>
> (9) The current regression tests only use the REPEATABLE cases. I
>      understand queries without this clause are RANDOM, but maybe we
>      could do something like this:
>
>         SELECT COUNT(*) BETWEEN 5000 AND 7000 FROM (
>           SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50)
>         ) foo;
>
>       Granted, there's still a small probability of false positive, but
>       maybe that's sufficiently small? Or is the amount of code this
>       tests negligible?

Well, depending on fillfactor and limit it could be made quite reliable 
I think, I also want to add test with VIEW (I think v2 has a bug there) 
and perhaps some JOIN.

>
> (10) In the initial patch you mentioned it's possible to write custom
>       sampling methods. Do you think a CREATE TABLESAMPLE METHOD,
>       allowing custom methods implemented as extensions would be useful?
>

Yes, I think so, CREATE/DROP TABLESAMPLE METHOD is on my TODO, but since 
that's just simple mechanical work with no hard problems to solve there 
I can add it later once we have agreement on the general approach of the 
patch (especially in the terms of extensibility).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Commit Fest 2014-12, Status after week 1
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: GiST kNN search queue (Re: KNN-GiST with recheck)