Re: Subselect query enhancement

Поиск
Список
Период
Сортировка
От Richard Huxton
Тема Re: Subselect query enhancement
Дата
Msg-id 45C309AE.1030506@archonet.com
обсуждение исходный текст
Ответ на Re: Subselect query enhancement  ("Michael Artz" <mlartz@gmail.com>)
Список pgsql-performance
Michael Artz wrote:
> Here are some numbers for 3 different queries using a very selective
> query (port = 2222).  I'm thinking that, since the row estimates are
> different from the actuals (2 vs 2000), that this particular port
> didn't make it into the statistics ... is that true?  Does this
> matter?  If this isn't enough data, I can type up some more.

SELECT * FROM pg_stats WHERE tablename='foo';
This will show you frequency-stats on each column (as generated by
analyse). You're interested in n_distinct, most_common_vals,
most_common_freqs.

> One thing that I forgot to mention is that the network_events and
> host_events tables are partitioned by first octet of the IP, so when I
> say "various scans of ..." that means that there is a scan of each of
> the partitions, the type determined by the planner and the statistics,
> I assume.

So you've got xxx_events tables partitioned by ip, but ip_info is one
table? Do you do a lot of scans across the bottom 3 bytes of the IP? If
not, I'm not clear what we're gaining from the partitioning.

> ** Dual subselects:
> SELECT * FROM ip_info
> WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222)
>       OR ip IN (SELECT ip FROM host_events WHERE port = 2222);
>
> Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual
> time=34..8238 rows=234 loops=1)
> Filter:  ... AND ((hashed_subplan) OR (hashed_subplan))

Well, the estimate here is rubbish - 9.3 million rows whereas we
actually get 234. Now we know you're likely to get a lot of overlap, and
the planner might not realise that. Still - that looks very bad. Of
course, because it's expecting so many rows a seq-scan of ip_info looks
like a good choice to it.

> ** Single subselect:
>
> SELECT * FROM ip_info
> WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222
>             UNION
>             SELECT DISTINCT ip FROM host_events WHERE port = 2222);
>
> Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106
> rows=2349 loops=1)

This is clearly a lot better, Not sure whether the DISTINCT in each
subquery works or not.

> ** The join:
>
> SELECT distinct ip_info.*
>  FROM ip_info RIGHT JOIN network_events USING (ip)
>       RIGHT JOIN  host_events USING (ip)
>  WHERE (network_events.port=2222 OR host_events.port=2222)
>
> Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52
> loops=1)
>  -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61
> rows=63 loops=1)

OK, so what do the plans look like for port=80 or something larger like
that?

Then try adding an index to the various host/network_events tables
CREATE INDEX ... ON ... (ip) WHERE port=80;

--
   Richard Huxton
   Archonet Ltd

В списке pgsql-performance по дате отправления:

Предыдущее
От: Ben
Дата:
Сообщение: drive configuration for a new server
Следующее
От: Kirk Wythers
Дата:
Сообщение: trouble with a join on OS X