Обсуждение: massive INSERT

Поиск
Список
Период
Сортировка

massive INSERT

От
Kurt Overberg
Дата:
Hi everyone!

I'm trying to maintain a 'membership' table in my postgres database. 
Membership is determined by a potentially costly algorithm that 
basically returns a series of member ids, where I insert those ids into 
my membership table (called groups).  Up to now, for speed purposes, 
I've been deleting all users in a group, then re-building the group from 
scratch.  The tables look roughly like this:

id            | integer  | not null default 
nextval('"xrefmgrp_id_seq"'::text)

membergroupid | integer  | not null default 0
memberid      | integer  | not null default 0

There's a constraint on the table saying that (membergroupid,memberid) 
needs to be UNIQUE.

...so before re-building a table, I do a:

delete from xrefmembergroup where membergroupid = 4 ;

...then blast the id's into the table:

insert into xrefmembergroup (membergroupid, memberid) select 4 as 
membergroupid, member.id as memberid from member where <* huge complex 
select*>


...I've found this to be faster then running the query, figuring out who 
needs to be removed from the group, who needs to be added and whatnot. 
The thing that I'm worried about is that this table is going to be 
pretty big (potentially millions of rows), and everytime I rebuild this 
table I lose (or at least invalidate) all my indexes.  Is that the case?  Is constantly deleting then adding large
numbersof rows from a table 
 
really bad for performance?  I'm worried this isn't going to scale well.   Anyone know of a better way to do this?
Thoughtsand comments would 
 
be appreciated.

Thanks!

/kurt



Re: massive INSERT

От
Greg Stark
Дата:
Kurt Overberg <kurt@hotdogrecords.com> writes:

> ...I've found this to be faster then running the query, figuring out who needs
> to be removed from the group, who needs to be added and whatnot. The thing that
> I'm worried about is that this table is going to be pretty big (potentially
> millions of rows), and everytime I rebuild this table I lose (or at least
> invalidate) all my indexes.  Is that the case? Is constantly deleting then
> adding large numbers of rows from a table really bad for performance?  I'm
> worried this isn't going to scale well.  Anyone know of a better way to do
> this?   Thoughts and comments would be appreciated.

From a programming point of view it's usually cleaner to do it as you're doing
than trying to find the differences and do the minimal changes. So I usually
go with that.

There are performance implications though. Every time you update or delete a
record in postgres it generates old garbage tuples that need to get cleaned
up. If you're doing a big batch job it can overflow the space set aside to
recover them which means instead of VACUUM you might have to do a VACUUM FULL
or else play with the fsm parameters for your database.

It's not clear from your description if you're regenerating the entire table
or just a specific group. If you're looping through all the groups
regenerating all of them you should perhaps consider using TRUNCATE instead of
delete. TRUNCATE is a bigger hammer and avoids generating any free space to
clean up.

Alternatively you should consider running VACUUM ANALYZE after every group
update you do and possibly a VACUUM FULL at an hour you can withstand some
downtime.

I'm not sure what you mean by losing your indexes. They'll be just as valid
afterwards as they were before, though they might become less efficient for
basically the same reasons as the table above. You may find running the
periodic REINDEX during downtime helps.

If you're updating the majority of the table and can do it during downtime you
might consider dropping the index while performing the update and then
recreating it at the end. But that's optional.

--
greg