Using ctid column changes plan drastically
| От | Thomas Kellerer |
|---|---|
| Тема | Using ctid column changes plan drastically |
| Дата | |
| Msg-id | julsfr$otg$1@dough.gmane.org обсуждение исходный текст |
| Ответы |
Re: Using ctid column changes plan drastically
|
| Список | pgsql-performance |
Hi,
I was testing a query to delete duplicates to see how well using ctid works if the table doesn't have a unique
identifieravailable.
The table definition is:
create table dupes
(
id integer primary key,
first_name text,
last_name text
);
My test table has 100.000 rows with ~13000 being actually unique.
The following statement:
DELETE FROM dupes
WHERE id NOT IN (SELECT min(b.id)
FROM dupes b
GROUP BY first_name, last_Name
HAVING count(*) > 1);
produces a quite nice execution plan:
Delete on public.dupes (cost=2770.00..4640.00 rows=50000 width=6) (actual time=299.809..299.809 rows=0 loops=1)
Buffers: shared hit=88100
-> Seq Scan on public.dupes (cost=2770.00..4640.00 rows=50000 width=6) (actual time=150.113..211.340 rows=86860
loops=1)
Output: dupes.ctid
Filter: (NOT (hashed SubPlan 1))
Buffers: shared hit=1240
SubPlan 1
-> HashAggregate (cost=2620.00..2745.00 rows=10000 width=18) (actual time=115.739..143.004 rows=13140
loops=1)
Output: min(b.id), b.first_name, b.last_name
Filter: (count(*) > 1)
Buffers: shared hit=620
-> Seq Scan on public.dupes b (cost=0.00..1620.00 rows=100000 width=18) (actual time=0.006..15.563
rows=100000loops=1)
Output: b.id, b.first_name, b.last_name
Buffers: shared hit=620
Total runtime: 301.241 ms
Now assuming I do not have a unique value in the table. In that case I would revert to using the ctid to identify
individualrows:
DELETE FROM dupes
WHERE ctid NOT IN (SELECT min(b.ctid)
FROM dupes b
GROUP BY first_name, last_Name
HAVING count(*) > 1);
Which has a completely different execution plan:
Delete on public.dupes (cost=2620.00..10004490.00 rows=50000 width=6) (actual time=269966.623..269966.623 rows=0
loops=1)
Buffers: shared hit=88720
-> Seq Scan on public.dupes (cost=2620.00..10004490.00 rows=50000 width=6) (actual time=176.107..269582.651
rows=86860loops=1)
Output: dupes.ctid
Filter: (NOT (SubPlan 1))
Buffers: shared hit=1240
SubPlan 1
-> Materialize (cost=2620.00..2795.00 rows=10000 width=20) (actual time=0.002..0.799 rows=12277
loops=100000)
Output: (min(b.ctid)), b.first_name, b.last_name
Buffers: shared hit=620
-> HashAggregate (cost=2620.00..2745.00 rows=10000 width=20) (actual time=131.162..164.941
rows=13140loops=1)
Output: min(b.ctid), b.first_name, b.last_name
Filter: (count(*) > 1)
Buffers: shared hit=620
-> Seq Scan on public.dupes b (cost=0.00..1620.00 rows=100000 width=20) (actual
time=0.005..29.531rows=100000 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=620
Total runtime: 269968.515 ms
This is Postgres 9.1.4 64bit on Windows 7
Why does the usage of the CTID column change the plan so drastically?
Regards
Thomas
В списке pgsql-performance по дате отправления: