Re: Very specialised query
От | Matthew Wakeling |
---|---|
Тема | Re: Very specialised query |
Дата | |
Msg-id | alpine.DEB.2.00.0903301618190.21772@aragorn.flymine.org обсуждение исходный текст |
Ответ на | Re: Very specialised query (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Very specialised query
(Віталій Тимчишин <tivv00@gmail.com>)
|
Список | pgsql-performance |
>> Shouldn't Postgres favour a "between" index scan over an open-ended >> one? On Fri, 27 Mar 2009, Tom Lane wrote: > Currently the planner only notices that for a range check that involves > comparisons of the same variable expression to two constants (or > pseudoconstants anyway). In principle it might be reasonable to have a > heuristic that reduces the estimated selectivity in the example above, > but it looks to me like it'd make clauselist_selectivity() a lot slower > and more complicated. When you see (l1.end > l2.start), how do you know > which variable to try to match up against others? And if you try to > match both, what do you do when you get matches for both? Arguably, having multiple "greater than" constraints on a field should not affect the selectivity much, because the separate constraints will all be throwing away the same set of rows. So having a single "greater than" will halve the number of rows, while two "greater than"s will divide the number of rows by three, etc. That's a vast approximation of course, and should be skewed by the statistics. However, combining a "greater than" with a "less than" has a different effect on selectivity. If the numbers were completely random with identical value spread in each field, then a single "greater than" would halve the number of rows and an additional "less than" would halve the number again. However, in most real life situations the values will not be completely random, but will be very skewed, as in our case. Unless the planner starts collecting statistics on the standard distribution of the difference between two fields, that will be hard to account for though. Have a look at the following EXPLAIN ANALYSE: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start AND l1.start < l2.start UNION ALL SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start AND l1.start >= l2.start; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Append (cost=0.00..20479179.74 rows=138732882 width=8) (actual time=0.055..2362748.298 rows=258210 loops=1) -> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) (actual time=0.053..627.038 rows=99436 loops=1) Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end)) -> Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) (actual time=0.025..58.604 rows=43534 loops=1) Index Cond: (objectid = 228000093) -> Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) (actual time=0.003..0.006 rows=2 loops=43534) Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start)) -> Nested Loop (cost=0.00..9545925.46 rows=69366441 width=8) (actual time=0.041..2361632.540 rows=158774 loops=1) Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end)) -> Index Scan using location_test_obj_end on location l1 (cost=0.00..55966.07 rows=43277 width=12) (actual time=0.009..80.814 rows=43534 loops=1) Index Cond: (objectid = 228000093) -> Index Scan using location_test_obj_start on location l2 (cost=0.00..123.10 rows=4809 width=12) (actual time=0.012..29.823 rows=21768 loops=43534) Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start)) Total runtime: 2363015.959 ms (14 rows) Note again the two leaf index scans that really matter for performance. On one of them, there is a difference, and the other is open ended. Expected rows Seen rows difference 4809 2 open-ended 4809 21768 The first query fragment takes 700ms to execute, and the second takes about forty minutes. It is effectively doing a nested loop join with hardly any benefit gained from the indexes at all, over a simple index on objectid. I may as well make the query a lot simpler, and do: SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 228000093 AND l2.objectid = 228000093 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start Unless this particular issue is improved in the planner, I don't think this particular style of query is going to work for us. I know that the database could in theory return a result in about 1400ms. I'll see how close to that I can get with plpgsql. Matthew -- First law of computing: Anything can go wro sig: Segmentation fault. core dumped.
В списке pgsql-performance по дате отправления: