Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition

Поиск
Список
Период
Сортировка
От Gavin Flower
Тема Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition
Дата
Msg-id 50A160E2.1050100@archidevsys.co.nz
обсуждение исходный текст
Ответ на Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition  (Rafał Rzepecki <divided.mind@gmail.com>)
Список pgsql-performance
On 12/11/12 22:06, Rafał Rzepecki wrote:
> This indeed works around the issue. Thanks!
>
> On Mon, Nov 12, 2012 at 9:53 AM, ashutosh durugkar <ashucould@gmail.com> wrote:
>> Hey Rafal,
>>
>>
>>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>> run_id IN (SELECT run_id FROM runs WHERE server_id = 515);
>>
>> could you try this:
>>
>>
>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>> run_id = ANY(ARRAY(SELECT run_id FROM runs WHERE server_id = 515));
>>
>> Thanks,
>>
>> On Sun, Nov 11, 2012 at 8:48 AM, Rafał Rzepecki <divided.mind@gmail.com>
>> wrote:
>>> [Please CC me on replies, as I'm not subscribed; thank you.]
>>>
>>> I've ran into a problem with the query planner and IN (subquery)
>>> conditions which I suspect to be a bug. I'll attempt to describe the
>>> relevant details of my database and explain which behaviour I find
>>> unexpected. I've also tried to trigger this behaviour in a clean
>>> database; I think I've succeeded, but the conditions are a bit
>>> different, so perhaps it's a different problem. I'll describe this
>>> setup in detail below.
>>>
>>> I have a somewhat large table (~2.5M rows), stats, which is quite
>>> often (several records a minute) INSERTed to, but never UPDATEd or
>>> DELETEd from. (In case it's relevant, it has an attached AFTER INSERT
>>> trigger which checks time and rebuilds an aggregate materialized view
>>> every hour.) This is the schema:
>>> # \d+ stats
>>>                                                    Table
>>> "serverwatch.stats"
>>>        Column      |            Type             |
>>> Modifiers                      | Storage | Description
>>>
>>>
------------------+-----------------------------+----------------------------------------------------+---------+-------------
>>>   id               | integer                     | not null default
>>> nextval('stats_id_seq'::regclass) | plain   |
>>>   run_id           | integer                     | not null
>>>                                | plain   |
>>>   start_time       | timestamp without time zone | not null
>>>                                | plain   |
>>>   end_time         | timestamp without time zone | not null
>>>                                | plain   |
>>>   cpu_utilization  | double precision            |
>>>                                | plain   |
>>>   disk_read_ops    | bigint                      |
>>>                                | plain   |
>>>   disk_write_ops   | bigint                      |
>>>                                | plain   |
>>>   network_out      | bigint                      |
>>>                                | plain   |
>>>   network_in       | bigint                      |
>>>                                | plain   |
>>>   disk_read_bytes  | bigint                      |
>>>                                | plain   |
>>>   disk_write_bytes | bigint                      |
>>>                                | plain   |
>>> Indexes:
>>>      "stats_pkey" PRIMARY KEY, btree (id)
>>>      "stats_day_index" btree (run_id, day(stats.*))
>>>      "stats_month_index" btree (run_id, month(stats.*))
>>>      "stats_week_index" btree (run_id, week(stats.*))
>>> Foreign-key constraints:
>>>      "stats_runs" FOREIGN KEY (run_id) REFERENCES runs(id)
>>> Triggers:
>>>      stats_day_refresh_trigger AFTER INSERT OR UPDATE ON stats FOR EACH
>>> STATEMENT EXECUTE PROCEDURE mat_view_refresh('serverwatch.stats_day')
>>> Has OIDs: no
>>>
>>> day(), month() and week() functions are just trivial date_trunc on a
>>> relevant field. The referenced table looks like this:
>>> # \d+ runs
>>>                                                   Table "serverwatch.runs"
>>>       Column      |            Type             |
>>> Modifiers                     | Storage | Description
>>>
>>>
-----------------+-----------------------------+---------------------------------------------------+---------+-------------
>>>   id              | integer                     | not null default
>>> nextval('runs_id_seq'::regclass) | plain   |
>>>   server_id       | integer                     | not null
>>>                              | plain   |
>>>   flavor          | flavor                      | not null
>>>                              | plain   |
>>>   region          | region                      | not null
>>>                              | plain   |
>>>   launch_time     | timestamp without time zone | not null
>>>                              | plain   |
>>>   stop_time       | timestamp without time zone |
>>>                              | plain   |
>>>   project_info_id | integer                     | not null
>>>                              | plain   |
>>>   owner_info_id   | integer                     | not null
>>>                              | plain   |
>>> Indexes:
>>>      "runs_pkey" PRIMARY KEY, btree (id)
>>>      "index_runs_on_flavor" btree (flavor)
>>>      "index_runs_on_owner_info_id" btree (owner_info_id)
>>>      "index_runs_on_project_info_id" btree (project_info_id)
>>>      "index_runs_on_region" btree (region)
>>>      "index_runs_on_server_id" btree (server_id)
>>> Foreign-key constraints:
>>>      "runs_owner_info_id_fkey" FOREIGN KEY (owner_info_id) REFERENCES
>>> user_infos(id)
>>>      "runs_project_info_id_fkey" FOREIGN KEY (project_info_id)
>>> REFERENCES project_infos(id)
>>> Referenced by:
>>>      TABLE "stats_day" CONSTRAINT "stats_day_runs" FOREIGN KEY (run_id)
>>> REFERENCES runs(id)
>>>      TABLE "stats" CONSTRAINT "stats_runs" FOREIGN KEY (run_id)
>>> REFERENCES runs(id)
>>> Has OIDs: no
>>>
>>> Now consider this query - note I'm using a subselect here because the
>>> problem originally manifested itself with a view:
>>> SELECT * FROM (SELECT run_id, disk_write_ops FROM stats) AS s WHERE
>>> run_id IN (SELECT id FROM runs WHERE server_id = 515);
>>>
>>> As might be expected, the planner chooses to use one of the three
>>> indices with run_id:
>>> http://explain.depesz.com/s/XU3Q
>>>
>>> Now consider a similar query, but with aggregation:
>>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY
>>> run_id) AS s WHERE run_id IN (SELECT id FROM runs WHERE server_id =
>>> 515);
>>>
>>> Now the picture is very different. The planner, unexplicably,
>>> dismisses the index and opts instead to do a full scan on stats, the
>>> table 2.5 million rows big.
>>> http://explain.depesz.com/s/Rqt
>>>
>>> Note that the problem disappears when we replace the IN condition with
>>> literal:
>>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY
>>> run_id) AS s WHERE run_id IN (1815, 1816);
>>>
>>> The ids are the result of the inner select ran separately, so the
>>> query has the exact same result; it's worth pointing out that the
>>> planner has a correct estimate on the selectivity of the condition -
>>> exactly two rows from runs are selected, as expected. But when literal
>>> is used the planner correctly chooses to use the index:
>>> http://explain.depesz.com/s/lYc
>>>
>>> Similarly a correct plan is chosen when we unnest the inner SELECT:
>>> SELECT run_id, SUM(disk_write_ops) FROM stats WHERE run_id IN (SELECT
>>> id FROM runs WHERE server_id = 515) GROUP BY run_id;
>>> http://explain.depesz.com/s/dlwZ
>>>
>>>
>>> I've tried to replicate this on a clean database:
>>> CREATE TABLE runs(run_id serial PRIMARY KEY, server_id INTEGER NOT NULL);
>>> CREATE INDEX runs_server ON runs(server_id);
>>> CREATE TABLE stats(entry_id serial PRIMARY KEY, run_id integer
>>> REFERENCES runs NOT NULL, utilization INTEGER NOT NULL);
>>> CREATE INDEX stats_runs ON stats(run_id);
>>>
>>> Now let's try some queries:
>>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>>> run_id IN (1212, 2323, 121, 561, 21, 561, 125, 2, 55, 52, 42);
>>> http://explain.depesz.com/s/Kcb - fine, index used
>>>
>>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE
>>> run_id IN (SELECT run_id FROM runs WHERE server_id = 515);
>>> http://explain.depesz.com/s/QFs - seqscan!
>>> Obviously it doesn't mean much, as the tables are empty and there are
>>> no stats, but still a radically different plan is chosen for what is
>>> essentially the same query.
>>>
>>> Note that in this case the behaviour is the same even when unnested:
>>> SELECT run_id, utilization FROM stats WHERE run_id IN (SELECT run_id
>>> FROM runs WHERE server_id = 515);
>>> http://explain.depesz.com/s/y3GM
>>>
>>> So, is this a bug in the planner, or am I somehow subtly changing the
>>> semantics of the query and don't notice?
>>> I understand the planner perhaps tries to parallelize queries when a
>>> SELECT is used in the IN clause, but given the stats it doesn't seem
>>> to make much sense.
>>>
>>> Thanks, and let me know if you want me to test something on my
>>> database over here or if there's some relevant info I've ommited.
>>>
>>> (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc
>>> (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit
>>> running on Ubuntu 12.10, ubuntu package
>>> postgresql-9.1-9.1.6-1ubuntu1:amd64, default configuration)
>>> --
>>> Rafał Rzepecki
>>>
>>>
>>> --
>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>
Curious,  would the following be of any use?

SELECT DISTINCT
     r.run_id,
     s.utilization
FROM
     runs AS r JOIN stats AS s USING (run_id)
WHERE
     r.server_id = 515
/**/;/**/


Cheers,
Gavin


В списке pgsql-performance по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: PostreSQL v9.2 uses a lot of memory in Windows XP
Следующее
От: Jon Nelson
Дата:
Сообщение: postgres 8.4, COPY, and high concurrency