Обсуждение: Alternative to drop index, load data, recreate index?

Поиск
Список
Период
Сортировка

Alternative to drop index, load data, recreate index?

От
"Jason L. Buberel"
Дата:
When loading very large data exports (> 1 million records) I have found it necessary to use the following sequence to achieve even reasonable import performance:

1. Drop all indices on the recipient table
2. Use "copy recipient_table from '/tmp/input.file';"
3. Recreate all indices on the recipient table

However, I now have tables so large that even the 'recreate all indices' step is taking too long (15-20 minutes on 8.2.4).

I am considering moving to date-based partitioned tables (each table = one month-year of data, for example). Before I go that far - is there any other tricks I can or should be using to speed up my bulk data loading?

Thanks,
jason

Re: Alternative to drop index, load data, recreate index?

От
Jeff Davis
Дата:
On Mon, 2007-09-10 at 17:06 -0700, Jason L. Buberel wrote:
> When loading very large data exports (> 1 million records) I have
> found it necessary to use the following sequence to achieve even
> reasonable import performance:
>
> 1. Drop all indices on the recipient table
> 2. Use "copy recipient_table from '/tmp/input.file';"
> 3. Recreate all indices on the recipient table
>
> However, I now have tables so large that even the 'recreate all
> indices' step is taking too long (15-20 minutes on 8.2.4).
>
> I am considering moving to date-based partitioned tables (each table =
> one month-year of data, for example). Before I go that far - is there
> any other tricks I can or should be using to speed up my bulk data
> loading?

If you create the indexes with CONCURRENTLY, then you can write to the
tables while the indexes are being created. That might help reduce your
downtime window.

Regards,
    Jeff Davis


Re: Alternative to drop index, load data, recreate index?

От
"Scott Marlowe"
Дата:
On 9/10/07, Jason L. Buberel <jason@buberel.org> wrote:
>
>  When loading very large data exports (> 1 million records) I have found it
> necessary to use the following sequence to achieve even reasonable import
> performance:
>
>  1. Drop all indices on the recipient table
>  2. Use "copy recipient_table from '/tmp/input.file';"
>  3. Recreate all indices on the recipient table
>
>  However, I now have tables so large that even the 'recreate all indices'
> step is taking too long (15-20 minutes on 8.2.4).

Well, that's pretty much the fastest way to import data.
If you've got foreign keys you can look into disabling them while
importing, if they're slowing things down.

I've got indexes that individually take 20 to 30 minutes to create on
one large reporting table.  all 8 or so indexes take well over 2 hours
to create.  But I don't load it very often.

Note that with 8.2 you can create your new indexes in a non-blocking
mode so that the table is accessible while the indexes are being
created.

>  I am considering moving to date-based partitioned tables (each table = one
> month-year of data, for example). Before I go that far - is there any other
> tricks I can or should be using to speed up my bulk data loading?

Partitioning would help.  I've used it for tables where I've had to do
stuff like that, and it definitely helps.

Re: Alternative to drop index, load data, recreate index?

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2007-09-10 at 17:06 -0700, Jason L. Buberel wrote:
>> However, I now have tables so large that even the 'recreate all
>> indices' step is taking too long (15-20 minutes on 8.2.4).

> If you create the indexes with CONCURRENTLY, then you can write to the
> tables while the indexes are being created. That might help reduce your
> downtime window.

Also, I trust you've experimented to find the optimal
maintenance_work_mem for this task on your machine?  It's probably
more on 8.2.x than it was before, since we improved the sort code...

            regards, tom lane

Re: Alternative to drop index, load data, recreate index?

От
hubert depesz lubaczewski
Дата:
On Mon, Sep 10, 2007 at 05:06:35PM -0700, Jason L. Buberel wrote:
> I am considering moving to date-based partitioned tables (each table =
> one month-year of data, for example). Before I go that far - is there
> any other tricks I can or should be using to speed up my bulk data loading?

did you try pgbulkload? (http://pgbulkload.projects.postgresql.org/)

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)

Re: Alternative to drop index, load data, recreate index?

От
"Jason L. Buberel"
Дата:
Depesz,

Thank you for the suggestion- I thought I had read up on that tool earlier but had somehow managed to forget about it when starting this phase of my investigation.

Needless to say, I can confirm the claims made on the project homepage when using very large data sets.

- Loading 1.2M records into an indexed table:
  - pg_bulkload: 5m 29s
  - copy to: 53m 20s

These results were obtained using pg-8.2.4 with pg_bulkload-2.2.0.

-jason

hubert depesz lubaczewski wrote:
On Mon, Sep 10, 2007 at 05:06:35PM -0700, Jason L. Buberel wrote: 
I am considering moving to date-based partitioned tables (each table = 
one month-year of data, for example). Before I go that far - is there 
any other tricks I can or should be using to speed up my bulk data loading?   
did you try pgbulkload? (http://pgbulkload.projects.postgresql.org/)

depesz