Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition
От | Gavin Flower |
---|---|
Тема | Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition |
Дата | |
Msg-id | 50A160E2.1050100@archidevsys.co.nz обсуждение исходный текст |
Ответ на | Re: Planner sometimes doesn't use a relevant index with IN (subquery) condition (Rafał Rzepecki <divided.mind@gmail.com>) |
Список | pgsql-performance |
On 12/11/12 22:06, Rafał Rzepecki wrote: > This indeed works around the issue. Thanks! > > On Mon, Nov 12, 2012 at 9:53 AM, ashutosh durugkar <ashucould@gmail.com> wrote: >> Hey Rafal, >> >> >>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE >> run_id IN (SELECT run_id FROM runs WHERE server_id = 515); >> >> could you try this: >> >> >> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE >> run_id = ANY(ARRAY(SELECT run_id FROM runs WHERE server_id = 515)); >> >> Thanks, >> >> On Sun, Nov 11, 2012 at 8:48 AM, Rafał Rzepecki <divided.mind@gmail.com> >> wrote: >>> [Please CC me on replies, as I'm not subscribed; thank you.] >>> >>> I've ran into a problem with the query planner and IN (subquery) >>> conditions which I suspect to be a bug. I'll attempt to describe the >>> relevant details of my database and explain which behaviour I find >>> unexpected. I've also tried to trigger this behaviour in a clean >>> database; I think I've succeeded, but the conditions are a bit >>> different, so perhaps it's a different problem. I'll describe this >>> setup in detail below. >>> >>> I have a somewhat large table (~2.5M rows), stats, which is quite >>> often (several records a minute) INSERTed to, but never UPDATEd or >>> DELETEd from. (In case it's relevant, it has an attached AFTER INSERT >>> trigger which checks time and rebuilds an aggregate materialized view >>> every hour.) This is the schema: >>> # \d+ stats >>> Table >>> "serverwatch.stats" >>> Column | Type | >>> Modifiers | Storage | Description >>> >>> ------------------+-----------------------------+----------------------------------------------------+---------+------------- >>> id | integer | not null default >>> nextval('stats_id_seq'::regclass) | plain | >>> run_id | integer | not null >>> | plain | >>> start_time | timestamp without time zone | not null >>> | plain | >>> end_time | timestamp without time zone | not null >>> | plain | >>> cpu_utilization | double precision | >>> | plain | >>> disk_read_ops | bigint | >>> | plain | >>> disk_write_ops | bigint | >>> | plain | >>> network_out | bigint | >>> | plain | >>> network_in | bigint | >>> | plain | >>> disk_read_bytes | bigint | >>> | plain | >>> disk_write_bytes | bigint | >>> | plain | >>> Indexes: >>> "stats_pkey" PRIMARY KEY, btree (id) >>> "stats_day_index" btree (run_id, day(stats.*)) >>> "stats_month_index" btree (run_id, month(stats.*)) >>> "stats_week_index" btree (run_id, week(stats.*)) >>> Foreign-key constraints: >>> "stats_runs" FOREIGN KEY (run_id) REFERENCES runs(id) >>> Triggers: >>> stats_day_refresh_trigger AFTER INSERT OR UPDATE ON stats FOR EACH >>> STATEMENT EXECUTE PROCEDURE mat_view_refresh('serverwatch.stats_day') >>> Has OIDs: no >>> >>> day(), month() and week() functions are just trivial date_trunc on a >>> relevant field. The referenced table looks like this: >>> # \d+ runs >>> Table "serverwatch.runs" >>> Column | Type | >>> Modifiers | Storage | Description >>> >>> -----------------+-----------------------------+---------------------------------------------------+---------+------------- >>> id | integer | not null default >>> nextval('runs_id_seq'::regclass) | plain | >>> server_id | integer | not null >>> | plain | >>> flavor | flavor | not null >>> | plain | >>> region | region | not null >>> | plain | >>> launch_time | timestamp without time zone | not null >>> | plain | >>> stop_time | timestamp without time zone | >>> | plain | >>> project_info_id | integer | not null >>> | plain | >>> owner_info_id | integer | not null >>> | plain | >>> Indexes: >>> "runs_pkey" PRIMARY KEY, btree (id) >>> "index_runs_on_flavor" btree (flavor) >>> "index_runs_on_owner_info_id" btree (owner_info_id) >>> "index_runs_on_project_info_id" btree (project_info_id) >>> "index_runs_on_region" btree (region) >>> "index_runs_on_server_id" btree (server_id) >>> Foreign-key constraints: >>> "runs_owner_info_id_fkey" FOREIGN KEY (owner_info_id) REFERENCES >>> user_infos(id) >>> "runs_project_info_id_fkey" FOREIGN KEY (project_info_id) >>> REFERENCES project_infos(id) >>> Referenced by: >>> TABLE "stats_day" CONSTRAINT "stats_day_runs" FOREIGN KEY (run_id) >>> REFERENCES runs(id) >>> TABLE "stats" CONSTRAINT "stats_runs" FOREIGN KEY (run_id) >>> REFERENCES runs(id) >>> Has OIDs: no >>> >>> Now consider this query - note I'm using a subselect here because the >>> problem originally manifested itself with a view: >>> SELECT * FROM (SELECT run_id, disk_write_ops FROM stats) AS s WHERE >>> run_id IN (SELECT id FROM runs WHERE server_id = 515); >>> >>> As might be expected, the planner chooses to use one of the three >>> indices with run_id: >>> http://explain.depesz.com/s/XU3Q >>> >>> Now consider a similar query, but with aggregation: >>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY >>> run_id) AS s WHERE run_id IN (SELECT id FROM runs WHERE server_id = >>> 515); >>> >>> Now the picture is very different. The planner, unexplicably, >>> dismisses the index and opts instead to do a full scan on stats, the >>> table 2.5 million rows big. >>> http://explain.depesz.com/s/Rqt >>> >>> Note that the problem disappears when we replace the IN condition with >>> literal: >>> SELECT * FROM (SELECT run_id, SUM(disk_write_ops) FROM stats GROUP BY >>> run_id) AS s WHERE run_id IN (1815, 1816); >>> >>> The ids are the result of the inner select ran separately, so the >>> query has the exact same result; it's worth pointing out that the >>> planner has a correct estimate on the selectivity of the condition - >>> exactly two rows from runs are selected, as expected. But when literal >>> is used the planner correctly chooses to use the index: >>> http://explain.depesz.com/s/lYc >>> >>> Similarly a correct plan is chosen when we unnest the inner SELECT: >>> SELECT run_id, SUM(disk_write_ops) FROM stats WHERE run_id IN (SELECT >>> id FROM runs WHERE server_id = 515) GROUP BY run_id; >>> http://explain.depesz.com/s/dlwZ >>> >>> >>> I've tried to replicate this on a clean database: >>> CREATE TABLE runs(run_id serial PRIMARY KEY, server_id INTEGER NOT NULL); >>> CREATE INDEX runs_server ON runs(server_id); >>> CREATE TABLE stats(entry_id serial PRIMARY KEY, run_id integer >>> REFERENCES runs NOT NULL, utilization INTEGER NOT NULL); >>> CREATE INDEX stats_runs ON stats(run_id); >>> >>> Now let's try some queries: >>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE >>> run_id IN (1212, 2323, 121, 561, 21, 561, 125, 2, 55, 52, 42); >>> http://explain.depesz.com/s/Kcb - fine, index used >>> >>> SELECT * FROM (SELECT run_id, utilization FROM stats) AS s WHERE >>> run_id IN (SELECT run_id FROM runs WHERE server_id = 515); >>> http://explain.depesz.com/s/QFs - seqscan! >>> Obviously it doesn't mean much, as the tables are empty and there are >>> no stats, but still a radically different plan is chosen for what is >>> essentially the same query. >>> >>> Note that in this case the behaviour is the same even when unnested: >>> SELECT run_id, utilization FROM stats WHERE run_id IN (SELECT run_id >>> FROM runs WHERE server_id = 515); >>> http://explain.depesz.com/s/y3GM >>> >>> So, is this a bug in the planner, or am I somehow subtly changing the >>> semantics of the query and don't notice? >>> I understand the planner perhaps tries to parallelize queries when a >>> SELECT is used in the IN clause, but given the stats it doesn't seem >>> to make much sense. >>> >>> Thanks, and let me know if you want me to test something on my >>> database over here or if there's some relevant info I've ommited. >>> >>> (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, compiled by gcc >>> (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit >>> running on Ubuntu 12.10, ubuntu package >>> postgresql-9.1-9.1.6-1ubuntu1:amd64, default configuration) >>> -- >>> Rafał Rzepecki >>> >>> >>> -- >>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-performance >> > > Curious, would the following be of any use? SELECT DISTINCT r.run_id, s.utilization FROM runs AS r JOIN stats AS s USING (run_id) WHERE r.server_id = 515 /**/;/**/ Cheers, Gavin
В списке pgsql-performance по дате отправления: