Re: TABLESAMPLE patch
От | Tomas Vondra |
---|---|
Тема | Re: TABLESAMPLE patch |
Дата | |
Msg-id | 54981016.803@fuzzy.cz обсуждение исходный текст |
Ответ на | Re: TABLESAMPLE patch (Petr Jelinek <petr@2ndquadrant.com>) |
Список | pgsql-hackers |
On 22.12.2014 10:07, Petr Jelinek wrote: > On 21/12/14 18:38, Tomas Vondra wrote: >> >> (1) The patch adds a new catalog, but does not bump CATVERSION. >> > > I thought this was always done by committer? Right. Sorry for the noise. > >> (2) The catalog naming (pg_tablesamplemethod) seems a bit awkward, >> as it squishes everything into a single chunk. That's inconsistent >> with naming of the other catalogs. I think pg_table_sample_method >> would be better. > > Fair point, but perhaps pg_tablesample_method then as tablesample is > used as single word everywhere including the standard. Sounds good. >> (4) I noticed there's an interesting extension in SQL Server, which >> allows specifying PERCENT or ROWS, so you can say >> >> SELECT * FROM table TABLESAMPLE SYSTEM (25 PERCENT); >> >> or >> >> SELECT * FROM table TABLESAMPLE SYSTEM (2500 ROWS); >> >> That seems handy, and it'd make migration from SQL Server easier. >> What do you think? > > Well doing it exactly this way somewhat kills the extensibility which > was one of the main goals for me - I allow any kind of parameters for > sampling and the handling of those depends solely on the sampling > method. This means that in my approach if you'd want to change what you > are limiting you'd have to write new sampling method and the query would > then look like SELECT * FROM table TABLESAMPLE SYSTEM_ROWLIMIT (2500); > or some such (depending on how you name the sampling method). Or SELECT > * FROM table TABLESAMPLE SYSTEM (2500, 'ROWS'); if we chose to extend > the SYSTEM sampling method, that would be also doable. > > The reason for this is that I don't want to really limit too much what > parameters can be send to sampling (I envision even SYSTEM_TIMED > sampling method that will get limit as time interval for example). Good point. >> >> (6) This seems slightly wrong, because of long/uint32 mismatch: >> >> long seed = PG_GETARG_UINT32(1); >> >> I think uint32 would be more appropriate, no? > > Probably, although I need long later in the algorithm anyway. Really? ISTM the sampler_setseed() does not really require long, uint32 would work exactly the same. >> >> (9) The current regression tests only use the REPEATABLE cases. I >> understand queries without this clause are RANDOM, but maybe we >> could do something like this: >> >> SELECT COUNT(*) BETWEEN 5000 AND 7000 FROM ( >> SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50) >> ) foo; >> >> Granted, there's still a small probability of false positive, but >> maybe that's sufficiently small? Or is the amount of code this >> tests negligible? > > Well, depending on fillfactor and limit it could be made quite reliable > I think, I also want to add test with VIEW (I think v2 has a bug there) > and perhaps some JOIN. OK. >> >> (10) In the initial patch you mentioned it's possible to write custom >> sampling methods. Do you think a CREATE TABLESAMPLE METHOD, >> allowing custom methods implemented as extensions would be useful? >> > > Yes, I think so, CREATE/DROP TABLESAMPLE METHOD is on my TODO, but since > that's just simple mechanical work with no hard problems to solve there > I can add it later once we have agreement on the general approach of the > patch (especially in the terms of extensibility). OK, good to know. Tomas
В списке pgsql-hackers по дате отправления: