Обсуждение: Delete tables difference involves seq scan

Поиск
Список
Период
Сортировка

Delete tables difference involves seq scan

От
Danylo Hlynskyi
Дата:
Hello. I want to remove rows from first table, that exist in second (equality is done using PK). However I experience seq scan on second table, which counters my intuition - I think it should be index-only. Because tables are large, performance of query is very bad. However I got mixed results when trying to reproduce this behavior on syntetic tables. Here I'll show 3 different plans, which I got for the same query. 1. Setup is: --------------------------- create table diff (id uuid constraint diff_pkey primary key); create table origin (id uuid constraint origin_pkey primary key); --------------------------- The query generates correct plan, which performs only index scans: explain delete from origin where exists (select id from diff where origin.id = diff.id); QUERY PLAN ------------------------------------------------------------------------------------------- Delete on origin (cost=0.30..105.56 rows=1850 width=12) -> Merge Semi Join (cost=0.30..105.56 rows=1850 width=12) Merge Cond: (origin.id = diff.id) -> Index Scan using origin_pkey on origin (cost=0.15..38.90 rows=1850 width=22) -> Index Scan using diff_pkey on diff (cost=0.15..38.90 rows=1850 width=22) (5 rows) 2. Setup is: -------------------------------- create table origin (id uuid constraint origin_pkey primary key, data jsonb); create table diff (id uuid constraint diff_pkey primary key, data jsonb); -------------------------------- The query generates plan with two seq scans: explain delete from origin where exists (select id from diff where origin.id = diff.id); QUERY PLAN --------------------------------------------------------------------------- Delete on origin (cost=34.08..69.49 rows=1070 width=12) -> Hash Semi Join (cost=34.08..69.49 rows=1070 width=12) Hash Cond: (origin.id = diff.id) -> Seq Scan on origin (cost=0.00..20.70 rows=1070 width=22) -> Hash (cost=20.70..20.70 rows=1070 width=22) -> Seq Scan on diff (cost=0.00..20.70 rows=1070 width=22) (6 rows) 3. My real `origin` table has 26 fields and 800 billion rows, real `diff` table has 12 million rows and the query generates plan with nested loop and seq scan on `diff` table: explain delete from drug_refills origin where exists (select id from drug_refills_diff diff where origin.id = diff.id); QUERY PLAN ---------------------------------------------------------------------------------------------------------- Delete on drug_refills origin (cost=0.57..22049570.11 rows=11975161 width=12) -> Nested Loop (cost=0.57..22049570.11 rows=11975161 width=12) -> Seq Scan on drug_refills_diff diff (cost=0.00..720405.61 rows=11975161 width=22) -> Index Scan using drug_refills_pkey on drug_refills origin (cost=0.57..1.77 rows=1 width=22) Index Cond: (id = diff.id) (5 rows) I have run ANALYZE on both tables, but it didn't help. Here are column types in origin and diff (same schema), if that matters: uuid timestamp with time zone timestamp with time zone character varying(255) character varying(255) character varying(1024) numeric(10,4) integer numeric(14,8) numeric(14,8) numeric(14,8) numeric(14,8) numeric(14,8) character varying(16) character varying(16) character varying(16) character varying(16) character varying(16) character varying(16) date jsonb text[] uuid uuid uuid uuid

Re: Delete tables difference involves seq scan

От
Danylo Hlynskyi
Дата:
Oh, sorry, this happens on Postgresql 9.6.6. I've checked that on Postgresql 10.0 query plan from setup (1) now uses two seq scans, like in setup (2). 2017-12-01 15:03 GMT+02:00 Danylo Hlynskyi : > Hello. I want to remove rows from first table, that exist in second > (equality is done using PK). However I experience seq scan on second table, > which counters my intuition - I think it should be index-only. Because > tables are large, performance of query is very bad. > > However I got mixed results when trying to reproduce this behavior on > syntetic tables. Here I'll show 3 different plans, which I got for the same > query. > > 1. Setup is: > --------------------------- > create table diff (id uuid constraint diff_pkey primary key); > create table origin (id uuid constraint origin_pkey primary key); > --------------------------- > > The query generates correct plan, which performs only index scans: > > explain delete from origin where exists (select id from diff where > origin.id = diff.id); > QUERY PLAN > ------------------------------------------------------------ > ------------------------------- > Delete on origin (cost=0.30..105.56 rows=1850 width=12) > -> Merge Semi Join (cost=0.30..105.56 rows=1850 width=12) > Merge Cond: (origin.id = diff.id) > -> Index Scan using origin_pkey on origin (cost=0.15..38.90 > rows=1850 width=22) > -> Index Scan using diff_pkey on diff (cost=0.15..38.90 > rows=1850 width=22) > (5 rows) > > 2. Setup is: > -------------------------------- > create table origin (id uuid constraint origin_pkey primary key, data > jsonb); > create table diff (id uuid constraint diff_pkey primary key, data jsonb); > -------------------------------- > > The query generates plan with two seq scans: > > explain delete from origin where exists (select id from diff where > origin.id = diff.id); > QUERY PLAN > ------------------------------------------------------------ > --------------- > Delete on origin (cost=34.08..69.49 rows=1070 width=12) > -> Hash Semi Join (cost=34.08..69.49 rows=1070 width=12) > Hash Cond: (origin.id = diff.id) > -> Seq Scan on origin (cost=0.00..20.70 rows=1070 width=22) > -> Hash (cost=20.70..20.70 rows=1070 width=22) > -> Seq Scan on diff (cost=0.00..20.70 rows=1070 width=22) > (6 rows) > > 3. My real `origin` table has 26 fields and 800 billion rows, real `diff` > table has 12 million rows and the query generates plan with nested loop and > seq scan on `diff` table: > > explain delete from drug_refills origin where exists (select id from > drug_refills_diff diff where origin.id = diff.id); > QUERY PLAN > ------------------------------------------------------------ > ---------------------------------------------- > Delete on drug_refills origin (cost=0.57..22049570.11 rows=11975161 > width=12) > -> Nested Loop (cost=0.57..22049570.11 rows=11975161 width=12) > -> Seq Scan on drug_refills_diff diff (cost=0.00..720405.61 > rows=11975161 width=22) > -> Index Scan using drug_refills_pkey on drug_refills origin > (cost=0.57..1.77 rows=1 width=22) > Index Cond: (id = diff.id) > (5 rows) > > I have run ANALYZE on both tables, but it didn't help. Here are column > types in origin and diff (same schema), if that matters: > > uuid > timestamp with time zone > timestamp with time zone > character varying(255) > character varying(255) > character varying(1024) > numeric(10,4) > integer > numeric(14,8) > numeric(14,8) > numeric(14,8) > numeric(14,8) > numeric(14,8) > character varying(16) > character varying(16) > character varying(16) > character varying(16) > character varying(16) > character varying(16) > date > jsonb > text[] > uuid > uuid > uuid > uuid > >

Re: Delete tables difference involves seq scan

От
Danylo Hlynskyi
Дата:
I was able to speedup original query a lot by using CTE. It still uses seq scan on `diff` table, but looks like it does this once: explain with diff as (select id from drug_refills_diff) delete from drug_refills where id in (select id from diff); QUERY PLAN --------------------------------------------------------------------------------------------------- Delete on drug_refills (cost=989844.94..990366.86 rows=456888836 width=46) CTE diff -> Seq Scan on drug_refills_diff (cost=0.00..720404.88 rows=11975088 width=16) -> Nested Loop (cost=269440.05..269961.98 rows=456888836 width=46) -> HashAggregate (cost=269439.48..269441.48 rows=200 width=56) Group Key: diff.id -> CTE Scan on diff (cost=0.00..239501.76 rows=11975088 width=56) -> Index Scan using drug_refills_pkey on drug_refills (cost=0.57..2.59 rows=1 width=22) Index Cond: (id = diff.id) (9 rows) 2017-12-01 15:17 GMT+02:00 Danylo Hlynskyi : > Oh, sorry, this happens on Postgresql 9.6.6. I've checked that on > Postgresql 10.0 query plan from setup (1) > now uses two seq scans, like in setup (2). > > > 2017-12-01 15:03 GMT+02:00 Danylo Hlynskyi : > >> Hello. I want to remove rows from first table, that exist in second >> (equality is done using PK). However I experience seq scan on second table, >> which counters my intuition - I think it should be index-only. Because >> tables are large, performance of query is very bad. >> >> However I got mixed results when trying to reproduce this behavior on >> syntetic tables. Here I'll show 3 different plans, which I got for the same >> query. >> >> 1. Setup is: >> --------------------------- >> create table diff (id uuid constraint diff_pkey primary key); >> create table origin (id uuid constraint origin_pkey primary key); >> --------------------------- >> >> The query generates correct plan, which performs only index scans: >> >> explain delete from origin where exists (select id from diff where >> origin.id = diff.id); >> QUERY PLAN >> ------------------------------------------------------------ >> ------------------------------- >> Delete on origin (cost=0.30..105.56 rows=1850 width=12) >> -> Merge Semi Join (cost=0.30..105.56 rows=1850 width=12) >> Merge Cond: (origin.id = diff.id) >> -> Index Scan using origin_pkey on origin (cost=0.15..38.90 >> rows=1850 width=22) >> -> Index Scan using diff_pkey on diff (cost=0.15..38.90 >> rows=1850 width=22) >> (5 rows) >> >> 2. Setup is: >> -------------------------------- >> create table origin (id uuid constraint origin_pkey primary key, data >> jsonb); >> create table diff (id uuid constraint diff_pkey primary key, data jsonb); >> -------------------------------- >> >> The query generates plan with two seq scans: >> >> explain delete from origin where exists (select id from diff where >> origin.id = diff.id); >> QUERY PLAN >> ------------------------------------------------------------ >> --------------- >> Delete on origin (cost=34.08..69.49 rows=1070 width=12) >> -> Hash Semi Join (cost=34.08..69.49 rows=1070 width=12) >> Hash Cond: (origin.id = diff.id) >> -> Seq Scan on origin (cost=0.00..20.70 rows=1070 width=22) >> -> Hash (cost=20.70..20.70 rows=1070 width=22) >> -> Seq Scan on diff (cost=0.00..20.70 rows=1070 width=22) >> (6 rows) >> >> 3. My real `origin` table has 26 fields and 800 billion rows, real `diff` >> table has 12 million rows and the query generates plan with nested loop and >> seq scan on `diff` table: >> >> explain delete from drug_refills origin where exists (select id from >> drug_refills_diff diff where origin.id = diff.id); >> QUERY PLAN >> ------------------------------------------------------------ >> ---------------------------------------------- >> Delete on drug_refills origin (cost=0.57..22049570.11 rows=11975161 >> width=12) >> -> Nested Loop (cost=0.57..22049570.11 rows=11975161 width=12) >> -> Seq Scan on drug_refills_diff diff (cost=0.00..720405.61 >> rows=11975161 width=22) >> -> Index Scan using drug_refills_pkey on drug_refills origin >> (cost=0.57..1.77 rows=1 width=22) >> Index Cond: (id = diff.id) >> (5 rows) >> >> I have run ANALYZE on both tables, but it didn't help. Here are column >> types in origin and diff (same schema), if that matters: >> >> uuid >> timestamp with time zone >> timestamp with time zone >> character varying(255) >> character varying(255) >> character varying(1024) >> numeric(10,4) >> integer >> numeric(14,8) >> numeric(14,8) >> numeric(14,8) >> numeric(14,8) >> numeric(14,8) >> character varying(16) >> character varying(16) >> character varying(16) >> character varying(16) >> character varying(16) >> character varying(16) >> date >> jsonb >> text[] >> uuid >> uuid >> uuid >> uuid >> >> >