Обсуждение: Slow query with self-join, group by, 100m rows

Поиск
Список
Период
Сортировка

Slow query with self-join, group by, 100m rows

От
Thomas Kappler
Дата:
[please CC, I'm not on the list]

Hi all,

we have one table that basically uses Postgres as a key-value store.

     Table "public.termindex"
Column   |  Type   | Modifiers
-------------+---------+-----------
 subject_id | integer |
 indextype  | integer |
 cid        | integer |

This is with Postgres 9.0.

The table has 96 million rows and an index on each column. It contains
no NULLs and has no triggers.

subject_id has about 2m distinct values, cid about 200k, and indextype only six.

The table is *read-only* after the initial load.

The query we want to do is (with example values):

select t.cid, count(distinct t1.subject_id)
from termindex t1, termindex t2
where t1.cid=20642 and t1.indextype=636 and t2.indextype=636 and
t2.subject_id=t1.subject_id
group by t2.cid;

The goal is to select all subjects matching certain criteria, and for
all cids (concept ids) that they have, determine the distribution over
the whole population, for a given indextype.

For instance, if the criteria are "cid=1 and indextype=636", let's say
subjects 1,2,3 match. The indextype we want the distribution for is
999. So we get all cids where subject_id in (1,2,3) and indextype=999,
group these by cid with count(subject_id) per cid. The result would
look like

cid  | count
-----+-------
  12 |     1
  13 |    28
...

Another way of asking this query is with an inner select:

select cid, count(subject_id) from termindex where subject_id in
(select subject_id from termindex where cid=18869 and indextype=636)
and indextype=636 group by cid;

On this 96m rows table, the join query takes about 50 seconds. EXPLAIN
ANALYZE output below. The inner select query takes much longer.

Pasting the explain analyze output into
http://explain.depesz.com/s/Yr4 we see that Postgres is doing an
external sort using about 150MB of data.

Now, we're not Postgres experts, or even great at relational design.
Are there better ways of doing that query, or designing the table? For
the latter we do have a number of constraints, though, that I don't
want to outline now because this mail is already long enough.

Thanks in advance!
Thomas


QUERY PLAN
---------------------------------------------------------------------
 GroupAggregate  (cost=446092576.21..459395622.23 rows=200 width=8)
(actual time=18927.047..19001.072 rows=2562 loops=1)
   ->  Sort  (cost=446092576.21..450526924.05 rows=1773739136 width=8)
(actual time=18927.025..18952.726 rows=119429 loops=1)
         Sort Key: t.cid
         Sort Method:  external merge  Disk: 2088kB
         ->  Merge Join  (cost=1480064.68..28107420.08 rows=1773739136
width=8) (actual time=14300.547..18836.386 rows=119429 loops=1)
               Merge Cond: (t1.subject_id = t.subject_id)
               ->  Sort  (cost=44173.64..44278.93 rows=42116 width=4)
(actual time=30.148..33.965 rows=14466 loops=1)
                     Sort Key: t1.subject_id
                     Sort Method:  external merge  Disk: 200kB
                     ->  Bitmap Heap Scan on mcindex t1
(cost=791.57..40361.19 rows=42116 width=4) (actual time=3.901..18.655
rows=14466 loops=1)
                           Recheck Cond: (cid = 20642)
                           ->  Bitmap Index Scan on mc2
(cost=0.00..781.04 rows=42116 width=0) (actual time=3.319..3.319
rows=14466 loops=1)
                                 Index Cond: (cid = 20642)
               ->  Materialize  (cost=1435891.04..1478006.60
rows=8423113 width=8) (actual time=14270.211..17554.299 rows=8423170
loops=1)
                     ->  Sort  (cost=1435891.04..1456948.82
rows=8423113 width=8) (actual time=14270.202..16346.835 rows=8423094
loops=1)
                           Sort Key: t.subject_id
                           Sort Method:  external merge  Disk: 148232kB
                           ->  Seq Scan on mcindex t
(cost=0.00..121502.13 rows=8423113 width=8) (actual
time=0.012..1381.282 rows=8423113 loops=1)
 Total runtime: 22095.280 ms
(19 rows)

Re: Slow query with self-join, group by, 100m rows

От
Robert Klemme
Дата:
On Tue, Sep 20, 2011 at 7:43 PM, Thomas Kappler <tkappler@googlemail.com> wrote:
> [please CC, I'm not on the list]
>
> Hi all,
>
> we have one table that basically uses Postgres as a key-value store.
>
>     Table "public.termindex"
> Column   |  Type   | Modifiers
> -------------+---------+-----------
>  subject_id | integer |
>  indextype  | integer |
>  cid        | integer |
>
> This is with Postgres 9.0.
>
> The table has 96 million rows and an index on each column. It contains
> no NULLs and has no triggers.
>
> subject_id has about 2m distinct values, cid about 200k, and indextype only six.
>
> The table is *read-only* after the initial load.
>
> The query we want to do is (with example values):
>
> select t.cid, count(distinct t1.subject_id)
> from termindex t1, termindex t2
> where t1.cid=20642 and t1.indextype=636 and t2.indextype=636 and
> t2.subject_id=t1.subject_id
> group by t2.cid;

Do you have any multi column indexes?  From the text of your query it
seems it could benefit from these two indexes:

(cid, indextype)
(subject_id, indextype)

I do not know whether PostgreSQL can avoid the table if you make the first index
(cid, indextype, subject_id)
in other words: append all the columns needed for the join.  In theory
the query could then be satisfied from the indexes.

> Pasting the explain analyze output into
> http://explain.depesz.com/s/Yr4 we see that Postgres is doing an
> external sort using about 150MB of data.
>
> Now, we're not Postgres experts, or even great at relational design.
> Are there better ways of doing that query, or designing the table? For
> the latter we do have a number of constraints, though, that I don't
> want to outline now because this mail is already long enough.

Those are probably important to know.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Re: Slow query with self-join, group by, 100m rows

От
Tom Lane
Дата:
Thomas Kappler <tkappler@googlemail.com> writes:
> The query we want to do is (with example values):

> select t.cid, count(distinct t1.subject_id)
> from termindex t1, termindex t2
> where t1.cid=20642 and t1.indextype=636 and t2.indextype=636 and
> t2.subject_id=t1.subject_id
> group by t2.cid;

The EXPLAIN output you provided doesn't appear to match this query (in
particular, I don't see the indextype restrictions being checked
anyplace in the plan).

One quick-and-dirty thing that might help is to raise work_mem enough
so that (1) you get a hash aggregation not a sort/group one, and (2)
if there are still any sorts being done, they don't spill to disk.
That will probably be a higher number than would be prudent to install
as a global setting, but you could SET it locally in the current
session before issuing the expensive query.

            regards, tom lane