Re: Very specialised query

Список
Период
Сортировка
От Matthew Wakeling
Тема Re: Very specialised query
Дата
Msg-id alpine.DEB.2.00.0903301618190.21772@aragorn.flymine.org
обсуждение исходный текст
Ответ на Re: Very specialised query  (Tom Lane)
Ответы Re: Very specialised query  (Віталій Тимчишин)
Список pgsql-performance
Дерево обсуждения
Very specialised query  (Matthew Wakeling, )
 Re: Very specialised query  ("Kevin Grittner", )
 Re: Very specialised query  (Tom Lane, )
  Re: Very specialised query  (Matthew Wakeling, )
 Re: Very specialised query  (Matthew Wakeling, )
  Re: Very specialised query  (Tom Lane, )
 Re: Very specialised query  (Віталій Тимчишин, )
  Re: Very specialised query  (Matthew Wakeling, )
   Re: Very specialised query  (Tom Lane, )
    Re: Very specialised query  (Matthew Wakeling, )
    Re: Very specialised query  (Matthew Wakeling, )
     Re: Very specialised query  (Віталій Тимчишин, )
      Re: Very specialised query  (Matthew Wakeling, )
       Re: Very specialised query  (Віталій Тимчишин, )
        Re: Very specialised query  (Matthew Wakeling, )
 Re: Very specialised query  (Dimitri Fontaine, )
  Re: Very specialised query  (Matthew Wakeling, )
 Re: Very specialised query  ("Marc Mamin", )
  Re: Very specialised query  (Matthew Wakeling, )
   Re: Very specialised query  ("Marc Mamin", )
    Re: Very specialised query  (Matthew Wakeling, )
 Re: Very specialised query  (Matthew Wakeling, )
  Re: Very specialised query  (Віталій Тимчишин, )
   Re: Very specialised query  (Matthew Wakeling, )
    Re: Very specialised query  (Matthew Wakeling, )
     Re: Very specialised query  (Matthew Wakeling, )
      Re: Very specialised query  (Craig Ringer, )
 Re: Very specialised query  ("Marc Mamin", )
  Re: Very specialised query  (Matthew Wakeling, )
>> Shouldn't Postgres favour a "between" index scan over an open-ended
>> one?

On Fri, 27 Mar 2009, Tom Lane wrote:
> Currently the planner only notices that for a range check that involves
> comparisons of the same variable expression to two constants (or
> pseudoconstants anyway).  In principle it might be reasonable to have a
> heuristic that reduces the estimated selectivity in the example above,
> but it looks to me like it'd make clauselist_selectivity() a lot slower
> and more complicated.  When you see (l1.end > l2.start), how do you know
> which variable to try to match up against others?  And if you try to
> match both, what do you do when you get matches for both?

Arguably, having multiple "greater than" constraints on a field should not
affect the selectivity much, because the separate constraints will all be
throwing away the same set of rows. So having a single "greater than" will
halve the number of rows, while two "greater than"s will divide the number
of rows by three, etc. That's a vast approximation of course, and should
be skewed by the statistics.

However, combining a "greater than" with a "less than" has a different
effect on selectivity. If the numbers were completely random with
identical value spread in each field, then a single "greater than" would
halve the number of rows and an additional "less than" would halve the
number again. However, in most real life situations the values will not be
completely random, but will be very skewed, as in our case. Unless the
planner starts collecting statistics on the standard distribution of the
difference between two fields, that will be hard to account for though.

Have a look at the following EXPLAIN ANALYSE:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start
     AND l1.start < l2.start
UNION ALL
SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start
     AND l1.start >= l2.start;

                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------
  Append  (cost=0.00..20479179.74 rows=138732882 width=8)
          (actual time=0.055..2362748.298 rows=258210 loops=1)
    ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                     (actual time=0.053..627.038 rows=99436 loops=1)
          Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
          ->  Index Scan using location_test_obj_end on location l1
                          (cost=0.00..55966.07 rows=43277 width=12)
                          (actual time=0.025..58.604 rows=43534 loops=1)
                Index Cond: (objectid = 228000093)
          ->  Index Scan using location_test_obj_start on location l2
                          (cost=0.00..123.10 rows=4809 width=12)
                          (actual time=0.003..0.006 rows=2 loops=43534)
                Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start))
    ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                     (actual time=0.041..2361632.540 rows=158774 loops=1)
          Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
          ->  Index Scan using location_test_obj_end on location l1
                          (cost=0.00..55966.07 rows=43277 width=12)
                          (actual time=0.009..80.814 rows=43534 loops=1)
                Index Cond: (objectid = 228000093)
          ->  Index Scan using location_test_obj_start on location l2
                          (cost=0.00..123.10 rows=4809 width=12)
                          (actual time=0.012..29.823 rows=21768 loops=43534)
                Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start))
  Total runtime: 2363015.959 ms
(14 rows)

Note again the two leaf index scans that really matter for performance. On
one of them, there is a difference, and the other is open ended.

                Expected rows     Seen rows
difference       4809                  2
open-ended       4809              21768

The first query fragment takes 700ms to execute, and the second takes
about forty minutes. It is effectively doing a nested loop join with
hardly any benefit gained from the indexes at all, over a simple index on
objectid. I may as well make the query a lot simpler, and do:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start

Unless this particular issue is improved in the planner, I don't think
this particular style of query is going to work for us. I know that the
database could in theory return a result in about 1400ms. I'll see how
close to that I can get with plpgsql.

Matthew

--
 First law of computing:  Anything can go wro
 sig: Segmentation fault.  core dumped.

В списке pgsql-performance по дате отправления:

Предыдущее
От: Scott Carey
Дата:
Сообщение: Re: Trying to track down weird query stalls
Следующее
От: Matthew Wakeling
Дата:
Сообщение: Re: Very specialised query