Re: Duplicate deletion optimizations

Поиск
Список
Период
Сортировка
От Samuel Gendler
Тема Re: Duplicate deletion optimizations
Дата
Msg-id CAEV0TzDgzktQaqvb+8h_WDibk3V==q+6Syf=vM4J6vtSke7M3g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Duplicate deletion optimizations  (Marc Eberhard <eberhardma@googlemail.com>)
Ответы Re: Duplicate deletion optimizations
Список pgsql-performance


On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard <eberhardma@googlemail.com> wrote:
Hi Samuel!

On 6 January 2012 20:02, Samuel Gendler <sgendler@ideasculptor.com> wrote:
> Have you considered doing the insert by doing a bulk insert into a temp
> table and then pulling rows that don't exist across to the final table in
> one query and updating rows that do exist in another query?  I did a very
> brief scan of the SO thread and didn't see it suggested.  Something like
> this:
>
> update stats_5mn set count = count + t.count
> from temp_table t
> where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
> stats_5mn.output_id = t.output_id;
>
> insert into stats_5mn
> select * from temp_table t
> where not exists (
> select 1 from stats_5mn s
> where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
> t.output_id
> );
>
> drop table temp_table;

Am I right to assume that the update/insert needs to be placed into a
begin / end transaction block if such batch uploads might happen
concurrently? Doesn't seem to be the case for this question here, but
I like the solution and wonder if it works under more general
circumstances.

yes, assuming you are concerned about making the insertion atomic.  Obviously, a failure in the second query after success in the 1st query would be problematic outside of a transaction, since any attempt to repeat the entire operation would result in repeated updates.
 
What's the overhead of creating and dropping a temporary table? Is it
only worth doing this for a large number of inserted/updated elements?
What if the number of inserts/updates is only a dozen at a time for a
large table (>10M entries)?

pretty minimal, but enough that doing a handful of rows at a time probably wouldn't be worth it.  You'd surely get index usage on a plain insert in such a case, so I'd probably just use an upsert stored proc for doing small numbers of rows - unless you are doing large numbers of inserts, just a few at a time.  In that case, I'd try to accumulate them and then do them in bulk.  Those are tough questions to answer without a specific context.  My real answer is 'try it and see.'  You'll always get an answer that is specific to your exact circumstance that way.

By the way, there is definitely a difference between creating a temp table and creating a table temporarily.  See the postgres docs about temp tables for specifics, but many databases treat temp tables differently from ordinary tables, so it is worth understanding what those differences are.  Temp tables are automatically dropped when a connection (or transaction) is closed.  Temp table names are local to the connection, so multiple connections can each create a temp table with the same name without conflict, which is convenient. I believe they are also created in a specific tablespace on disk, etc.

В списке pgsql-performance по дате отправления:

Предыдущее
От: Marc Eberhard
Дата:
Сообщение: Re: Duplicate deletion optimizations
Следующее
От: Marc Eberhard
Дата:
Сообщение: Re: Duplicate deletion optimizations