Re: Duplicate deletion optimizations
От | antoine@inaps.org |
---|---|
Тема | Re: Duplicate deletion optimizations |
Дата | |
Msg-id | 3907c59005bc154cf0b8ebb918303d51@inaps.org обсуждение исходный текст |
Ответ на | Duplicate deletion optimizations (antoine@inaps.org) |
Ответы |
Re: Duplicate deletion optimizations
(Jochen Erwied <jochen@pgsql-performance.erwied.eu>)
|
Список | pgsql-performance |
On Fri, 06 Jan 2012 15:35:36 +0100, antoine@inaps.org wrote: > Hello, > > I've a table with approximately 50 million rows with a schema like > this: > > id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), > t_value integer NOT NULL DEFAULT 0, > t_record integer NOT NULL DEFAULT 0, > output_id integer NOT NULL DEFAULT 0, > count bigint NOT NULL DEFAULT 0, > CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) > > Every 5 minutes, a process have to insert a few thousand of rows in > this table, > but sometime, the process have to insert an already existing row > (based on > values in the triplet (t_value, t_record, output_id). In this case, > the row > must be updated with the new count value. I've tried some solution > given on this > stackoverflow question [1] but the insertion rate is always too low > for my needs. > > So, I've decided to do it in two times: > > - I insert all my new data with a COPY command > - When it's done, I run a delete query to remove oldest duplicates > > Right now, my delete query look like this: > > SELECT min(id) FROM stats_5mn > GROUP BY t_value, t_record, output_id > HAVING count(*) > 1; Correction: DELETE FROM stats_5mn WHERE id in ( SELECT min(id) FROM stats_5mn GROUP BY t_value, t_record, output_id HAVING count(*) > 1; ); Sorry :-) > > The duration of the query on my test machine with approx. 16 million > rows is ~18s. > > To reduce this duration, I've tried to add an index on my triplet: > > CREATE INDEX test > ON stats_5mn > USING btree > (t_value , t_record , output_id ); > > By default, the PostgreSQL planner doesn't want to use my index and > do a sequential > scan [2], but if I force it with "SET enable_seqscan = off", the > index is used [3] > and query duration is lowered to ~5s. > > > My questions: > > - Why the planner refuse to use my index? > - Is there a better method for my problem? > > > Thanks by advance for your help, > Antoine Millet. > > > [1] > > http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql > > > http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different > > [2] http://explain.depesz.com/s/UzW : > GroupAggregate (cost=1167282.380..1294947.770 rows=762182 > width=20) (actual time=20067.661..20067.661 rows=0 loops=1) > Filter: (five(*) > 1) > -> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) > (actual time=15663.549..17463.458 rows=7621805 loops=1) > Sort Key: delta, kilo, four > Sort Method: external merge Disk: 223512kB > -> Seq Scan on three (cost=0.000..139734.140 > rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805 > loops=1) > > [3] http://explain.depesz.com/s/o9P : > GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) > (actual time=5307.734..5307.734 rows=0 loops=1) > Filter: (five(*) > 1) > -> Index Scan using charlie on three > (cost=0.000..11422738.330 rows=7621814 width=20) (actual > time=0.046..2062.952 rows=7621805 loops=1)
В списке pgsql-performance по дате отправления: