Обсуждение: generate_series woes
I think there's something sub-optimal with generate_series. In the following, "documents" is a table with more than 120000 rows, vacuumed and analyzed before the queries. EXPLAIN ANALYZE SELECT count (d.id), floor (s.val / 5000) FROM generate_series (1::INT, 5009) AS s (val) LEFT JOIN documents d ON d.id = s.val GROUP BY 2 ORDER BY 2; This returns: Sort (cost=4231.52..4232.02 rows=200 width=8) (actual time=41.886..41.887 rows=2 loops=1) Sort Key: (floor(((s.val / 5000))::double precision)) Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=4219.88..4223.88 rows=200 width=8) (actual time=41.843..41.846 rows=2 loops=1) -> Nested Loop Left Join (cost=0.00..4214.88 rows=1000 width=8) (actual time=1.274..38.193 rows=5009 loops=1) -> Function Scan on generate_series s (cost=0.00..12.50 rows=1000 width=4) (actual time=1.209..3.102 rows=5009loops=1) -> Index Scan using documents_pkey on documents d (cost=0.00..4.18 rows=1 width=4) (actual time=0.004..0.005rows=1 loops=5009) Index Cond: (d.id = s.val) Total runtime: 42.218 ms Now let's wrap generate_series into an SQL function: CREATE FUNCTION genser (int, int) RETURNS SETOF int AS $$ SELECT * FROM generate_series ($1, $2) AS g(x); $$ LANGUAGE sql; EXPLAIN ANALYZE SELECT count (d.id), floor (s.val / 5000) FROM genser (1::INT, 5009) AS s (val) LEFT JOIN documents d ON d.id = s.val GROUP BY 2 ORDER BY 2; Not surprisingly, this returns the same plan: Sort (cost=4479.02..4479.52 rows=200 width=8) (actual time=43.606..43.607 rows=2 loops=1) Sort Key: (floor(((s.val / 5000))::double precision)) Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=4467.38..4471.38 rows=200 width=8) (actual time=43.559..43.561 rows=2 loops=1) -> Nested Loop Left Join (cost=0.00..4462.38 rows=1000 width=8) (actual time=3.564..39.740 rows=5009 loops=1) -> Function Scan on genser s (cost=0.00..260.00 rows=1000 width=4) (actual time=3.503..5.435 rows=5009 loops=1) -> Index Scan using documents_pkey on documents d (cost=0.00..4.18 rows=1 width=4) (actual time=0.004..0.005rows=1 loops=5009) Index Cond: (d.id = s.val) Total runtime: 44.047 ms (9 rows) But look what happens if we tell PostgreSQL how many rows "genser" will return: CREATE FUNCTION genser (int, int) RETURNS SETOF int AS $$ SELECT * FROM generate_series ($1, $2) AS g(x); $$ LANGUAGE sql ROWS 5009; EXPLAIN ANALYZE SELECT count (d.id), floor (s.val / 5000) FROM genser (1::INT, 5009) AS s (val) LEFT JOIN documents d ON d.id = s.val GROUP BY 2 ORDER BY 2; Now we get a better plan: Sort (cost=15545.54..15546.04 rows=200 width=8) (actual time=27.857..27.859 rows=2 loops=1) Sort Key: (floor(((s.val / 5000))::double precision)) Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=15533.89..15537.89 rows=200 width=8) (actual time=27.817..27.819 rows=2 loops=1) -> Merge Right Join (cost=1610.15..15508.85 rows=5009 width=8) (actual time=7.714..24.133 rows=5009 loops=1) Merge Cond: (d.id = s.val) -> Index Scan using documents_pkey on documents d (cost=0.00..13472.20 rows=125518 width=4) (actual time=0.045..6.112rows=5010 loops=1) -> Sort (cost=1610.15..1622.67 rows=5009 width=4) (actual time=7.651..9.501 rows=5009 loops=1) Sort Key: s.val Sort Method: quicksort Memory: 427kB -> Function Scan on genser s (cost=0.00..1302.34 rows=5009 width=4) (actual time=3.559..5.262 rows=5009loops=1) Total runtime: 28.445 ms (12 rows) Since generate_series is a builtin function, can't it tell how many rows it will return?
On Mon, Apr 14, 2008 at 5:21 AM, Harald Fuchs <hari.fuchs@googlemail.com> wrote: > I think there's something sub-optimal with generate_series. > In the following, "documents" is a table with more than 120000 rows, > vacuumed and analyzed before the queries. everything is working exactly as intended. while it's obvious to you that the generate series function returns a particular number of rows based on your supplied inputs, it's not (yet) obvious to the planner. your genser function supplies the hint the planner needs and it adjusts the plan. most set returning functions (particularly non-immutable ones) are not so easy to determine the # of rows from the input parameters anyways. merlin
In article <b42b73150804150715r83cad1doa166230ec509f0d@mail.gmail.com>, "Merlin Moncure" <mmoncure@gmail.com> writes: > On Mon, Apr 14, 2008 at 5:21 AM, Harald Fuchs <hari.fuchs@googlemail.com> wrote: >> I think there's something sub-optimal with generate_series. >> In the following, "documents" is a table with more than 120000 rows, >> vacuumed and analyzed before the queries. > everything is working exactly as intended. while it's obvious to you > that the generate series function returns a particular number of rows > based on your supplied inputs, it's not (yet) obvious to the planner. Which was exactly my point. Since generate_series is a builtin function, the planner could theoretically know the number of rows returned, thus choosing a better plan. OTOH, the difference between theory and reality is in theory smaller than in reality. > your genser function supplies the hint the planner needs and it > adjusts the plan. most set returning functions (particularly > non-immutable ones) are not so easy to determine the # of rows from > the input parameters anyways. Yes, of course. I used "genser" just to show that there is a better plan.
On Mon, Apr 14, 2008 at 11:21:58AM +0200, Harald Fuchs wrote: > I think there's something sub-optimal with generate_series. > In the following, "documents" is a table with more than 120000 rows, > vacuumed and analyzed before the queries. > Since generate_series is a builtin function, can't it tell how many > rows it will return? i think it would be better off not to limit some functionality for builtin functions. it would be much nicer to have the ability to hint planer about rowcount from function *in* the sql. something like: select i from generate_series(1,10) {hint:10} as i; i'm not proposiung syntax. i'm suggesting the functionality. depesz -- quicksil1er: "postgres is excellent, but like any DB it requires a highly paid DBA. here's my CV!" :) http://www.depesz.com/ - blog dla ciebie (i moje CV)
hubert depesz lubaczewski <depesz@depesz.com> writes: > i think it would be better off not to limit some functionality for > builtin functions. it would be much nicer to have the ability to hint > planer about rowcount from function *in* the sql. > > something like: > > select i from generate_series(1,10) {hint:10} as i; > > i'm not proposiung syntax. i'm suggesting the functionality. I'm strongly declined for such non-SQL compliant solutions. I'd be appreciated if hackers can solve the problem internally, without bugging SQL syntax. Regards.
On Wed, Apr 16, 2008 at 03:37:22PM +0300, Volkan YAZICI wrote: > I'm strongly declined for such non-SQL compliant solutions. I'd be > appreciated if hackers can solve the problem internally, without bugging > SQL syntax. for generate_series - sure. but i have functions which change (in a known way) number of returned rows based on their arguments. depesz -- quicksil1er: "postgres is excellent, but like any DB it requires a highly paid DBA. here's my CV!" :) http://www.depesz.com/ - blog dla ciebie (i moje CV)
On Wed, Apr 16, 2008 at 8:37 AM, Volkan YAZICI <yazicivo@ttmail.com> wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > i think it would be better off not to limit some functionality for > > builtin functions. it would be much nicer to have the ability to hint > > planer about rowcount from function *in* the sql. > > > > something like: > > > > select i from generate_series(1,10) {hint:10} as i; > > > > i'm not proposiung syntax. i'm suggesting the functionality. > > I'm strongly declined for such non-SQL compliant solutions. I'd be > appreciated if hackers can solve the problem internally, without bugging > SQL syntax. maybe -- just an idle thought -- the 'ROWS' clause of create function could be expanded to take a simple expression based on the input parameters. merlin
On Wed, Apr 16, 2008 at 09:01:10AM -0400, Merlin Moncure wrote: > On Wed, Apr 16, 2008 at 8:37 AM, Volkan YAZICI <yazicivo@ttmail.com> wrote: > > hubert depesz lubaczewski <depesz@depesz.com> writes: > > > select i from generate_series(1,10) {hint:10} as i; > > > > > > i'm not proposiung syntax. i'm suggesting the functionality. > > > > I'm strongly declined for such non-SQL compliant solutions. I'd be > > appreciated if hackers can solve the problem internally, without bugging > > SQL syntax. > > maybe -- just an idle thought -- the 'ROWS' clause of create function > could be expanded to take a simple expression based on the input > parameters. In computer science terms, I think you mean that you want something known as dependant types. Full dependant types are probably overkill for PG, but some very limited form would be a fun project. The idea, in general, is to move code into the typesystem that is run during typechecking and used to prove that your code is doing the right thing. Within PG, this seems to naturally (from my naive viewpoint) extend to the case of providing hints (proofs would be the normal use) to the planner about what's going to happen when the code is run. This seems to imply that types couldn't be stored as OIDs any more (you'd be creating and destroying lots while planning the query) so would probably change the structure of the code rather significantly. Sam