Обсуждение: Disabling/Enabling index before bulk loading

Поиск
Список
Период
Сортировка

Disabling/Enabling index before bulk loading

От
Ravi Krishna
Дата:
We recently did a test on COPY and found that on large tables (47 million rows , 20GB of raw data) the
difference in COPY with 16 indexes and COPY without any index is 1:14. That is, COPY is 14 times slower
when data is ingested with all indexes as opposed to COPY first without index and then create all index.

I googled for earlier posting on this and it looks like this has been asked before too.

This is what I am thinking to do:

1 - Extract index definition and save it as a SQL somewhere, either a file or a table.
2 - Drop all indexes.
3 - Ingest data via COPY
4 - Recreate all indexes saved in (1).

Is there a generic sql or script or tool to accomplish (1).

thanks

Re: Disabling/Enabling index before bulk loading

От
Thomas Kellerer
Дата:
Ravi Krishna schrieb am 10.07.2018 um 16:08:
> 
> We recently did a test on COPY and found that on large tables (47 million rows , 20GB of raw data) the 
> difference in COPY with 16 indexes and COPY without any index is 1:14. That is, COPY is 14 times slower 
> when data is ingested with all indexes as opposed to COPY first without index and then create all index.
> 
> I googled for earlier posting on this and it looks like this has been asked before too.  
> 
> This is what I am thinking to do:
> 
> 1 - Extract index definition and save it as a SQL somewhere, either a file or a table.
> 2 - Drop all indexes.
> 3 - Ingest data via COPY
> 4 - Recreate all indexes saved in (1).
> 
> Is there a generic sql or script or tool to accomplish (1).

You can do this with a script like this:

-- First store the definitions of all the indexes in a table:

create table index_backup 
as
select *
from pg_indexes
where schemaname = 'public'
  and tablename = 'the_table';

-- now drop all the indexes:
do
$$
declare
  l_rec record;
begin
  for l_rec in select schemaname, indexname from index_backup
  loop
    execute format('drop index %I.%I', l_rec.schemaname, l_rec.indexname);
  end loop;
end;
$$

-- now insert the data

...

-- and restore all indexes

do
$$
declare
  l_rec record;
begin
  for l_rec in select indexdef from index_backup
  loop
     execute l_rec.indexdef;
  end loop;
end;
$$



Re: Disabling/Enabling index before bulk loading

От
Adrian Klaver
Дата:
On 07/10/2018 07:08 AM, Ravi Krishna wrote:
> 
> We recently did a test on COPY and found that on large tables (47 million rows , 20GB of raw data) the
> difference in COPY with 16 indexes and COPY without any index is 1:14. That is, COPY is 14 times slower
> when data is ingested with all indexes as opposed to COPY first without index and then create all index.

Did you include the time to CREATE INDEX after the COPY or is the 1:14 
only for the COPY stage?

> 
> I googled for earlier posting on this and it looks like this has been asked before too.
> 
> This is what I am thinking to do:
> 
> 1 - Extract index definition and save it as a SQL somewhere, either a file or a table.
> 2 - Drop all indexes.
> 3 - Ingest data via COPY
> 4 - Recreate all indexes saved in (1).
> 
> Is there a generic sql or script or tool to accomplish (1).
> 
> thanks
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Disabling/Enabling index before bulk loading

От
Ravi Krishna
Дата:
>
> Did you include the time to CREATE INDEX after the COPY or is the 1:14 only for the COPY stage?

Yes.

Time taken to load 47 mil rows with all 16 indexes intact:  14+ hrs
Time taken to load the same after dropping index and then loading and finally creating 16 indexes: 1 hr 40 min

Frankly I am surprised by this staggering difference.




Re: Disabling/Enabling index before bulk loading

От
Vick Khera
Дата:
On Tue, Jul 10, 2018 at 1:13 PM, Ravi Krishna <srkrishna@yahoo.com> wrote:
>
> Did you include the time to CREATE INDEX after the COPY or is the 1:14 only for the COPY stage?

Yes.

Time taken to load 47 mil rows with all 16 indexes intact:  14+ hrs
Time taken to load the same after dropping index and then loading and finally creating 16 indexes: 1 hr 40 min


Re: Disabling/Enabling index before bulk loading

От
Ravi Krishna
Дата:


This does not work in RDS.  In order to update system catalog tables (pg_index), one needs privileges which is
denied in RDS. In RDS terminology, the user must belong to role rdsadmin and that role is reserved only for AWS 
internal users.

Re: Disabling/Enabling index before bulk loading

От
Nikolay Samokhvalov
Дата:
On Tue, Jul 10, 2018 at 12:26 PM Ravi Krishna <srkrishna@yahoo.com> wrote:


This does not work in RDS.  In order to update system catalog tables (pg_index), one needs privileges which is
denied in RDS. In RDS terminology, the user must belong to role rdsadmin and that role is reserved only for AWS 
internal users.

Right, but pl/pgsql works fine there, so you can backup&drop all indexes and then restore them using `do $$ ... $$ language plpgsql;`, as Thomas Kellerer already suggested.

One note here: there is some old confusion between concepts of unique indexes and unique constraints in Postgres (see https://www.postgresql.org/message-id/flat/CANNMO%2BKHkkDg-FBi0_78ADmfLiT9kODmz%2B8m6fR6f5kPL-n_ZQ%40mail.gmail.com#cfb3a9eaed8649d7d24ad7944ccb37cf), so there is a chance that after such backup/restore procedure you'll get logically correct but phisically different state.

Re: Disabling/Enabling index before bulk loading

От
Tim Cross
Дата:
Ravi Krishna <srkrishna@yahoo.com> writes:

> We recently did a test on COPY and found that on large tables (47 million rows , 20GB of raw data) the 
> difference in COPY with 16 indexes and COPY without any index is 1:14. That is, COPY is 14 times slower 
> when data is ingested with all indexes as opposed to COPY first without index and then create all index.
>

This is very system dependent. On our system, when we tested a similar
approach, we found that the time saved through dropping the indexes
before copy was lost when rebuilding the indexes afterwards. In fact, it
ended up being slightly slower.

I suspect a lot depends on the number and types of indexes you
have. Your example had a lot more indexes than ours. We were loading 22
batches with 2.5M records per batch. While copy was significantly faster
than transaction based inserts (minutes vs hours), the differences between
indexes and no indexes was measured in minutes. We only had 3 or 4
indexes.   

> I googled for earlier posting on this and it looks like this has been asked before too.  
>
> This is what I am thinking to do:
>
> 1 - Extract index definition and save it as a SQL somewhere, either a file or a table.
> 2 - Drop all indexes.
> 3 - Ingest data via COPY
> 4 - Recreate all indexes saved in (1).
>
> Is there a generic sql or script or tool to accomplish (1).
>

We are loading data via Javascript using pg and pg-copy-streams modules. It is
pretty straight forward to drop the indexes and recreate them afterwards
via sql, so we didn't look for a tool as such.

As data is only inserted into this table and only by this process, we
also turned off autovacuum for this table, performing vacuum and analyze
manually after load. 

Tim

-- 
Tim Cross


Re: Disabling/Enabling index before bulk loading

От
Melvin Davidson
Дата:


> We recently did a test on COPY and found that on large tables (47 million rows , 20GB of raw data) the
> difference in COPY with 16 indexes...

I am very suspicious of why you need 16 indexes. Are you sure all those indexes are actually being utilized?
Try executing the attached query, You may find find some are really not needed.

--
Melvin Davidson
Maj. Database & Exploration Specialist

Universe Exploration Command – UXC

Employment by invitation only!
Вложения

Re: Disabling/Enabling index before bulk loading

От
Ravi Krishna
Дата:

I am very suspicious of why you need 16 indexes. Are you sure all those indexes are actually being utilized?
Try executing the attached query, You may find find some are really not needed.

This is a DATAMART application and the indexes are to satisfy a large number of queries possible from tableau. Keep in mind
this table has 200+ cols. 
That said, it is possible that a few of them may be unnecessary.  I will run your script and check that out.

thanks for the script.