Обсуждение: INSERT extremely slow with large data sets

Поиск
Список
Период
Сортировка

INSERT extremely slow with large data sets

От
Slavisa Garic
Дата:
Hi Everyone,

This is my first post here so please tell me to go somewhere else if this
is the wrong place to post questions like this.

I am using PostgreSQL 7.3.2 and have used earlier versions (7.1.x onwards)
and with all of them I noticed same problem with INSERTs when there is a
large data set. Just to so you guys can compare time it takes to insert
one row into a table when there are only few rows present and when there
are thousands:

Rows Present        Start Time        Finish Time
------------------------------------------------------------
100            1068790804.12           1068790804.12
1000            1068790807.87          1068790807.87
5000            1068790839.26        1068790839.27
10000            1068790909.24          1068790909.26
20000            1068791172.82        1068791172.85
30000            1068791664.06        1068791664.09 
40000            1068792369.94        1068792370.0
50000            1068793317.53        1068793317.6
60000            1068794369.38        1068794369.47

As you can see if takes awfully lots of time for me just to have those
values inserted. Now to make a picture a bit clearer for you this table
has lots of information in there, about 25 columns. Also there are few
indexes that I created so that the process of selecting values from there
is faster which by the way works fine. Selecting anything takes under 5
seconds.

Any help would be greatly appreciated even pointing me in the right
direction where to ask this question. By the way I designed the database
this way as my application that uses PGSQL a lot during the execution so
there was a huge need for fast SELECTs. Our experiments are getting larger
and larger every day so fast inserts would be good as well.

Just to note those times above are of INSERTs only. Nothing else done that
would be included in those times. Machine was also free and that was the
only process running all the time and the machine was Intel(R) Pentium(R)
4 CPU 2.40GHz.

Regards,
Slavisa



Re: INSERT extremely slow with large data sets

От
Alvaro Herrera
Дата:
On Fri, Nov 14, 2003 at 06:36:41PM +1100, Slavisa Garic wrote:

> Rows Present        Start Time        Finish Time
> ------------------------------------------------------------
> 100            1068790804.12           1068790804.12
> 1000            1068790807.87          1068790807.87
> 5000            1068790839.26        1068790839.27
> 10000            1068790909.24          1068790909.26
> 20000            1068791172.82        1068791172.85
> 30000            1068791664.06        1068791664.09 
> 40000            1068792369.94        1068792370.0
> 50000            1068793317.53        1068793317.6
> 60000            1068794369.38        1068794369.47

> [too slow]

Ok, so inserting 60000 rows seems to take 0.09 seconds, and inserting
5000 takes only 0.01.  And your problem is exactly what?

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Coge la flor que hoy nace alegre, ufana. ¿Quién sabe si nacera otra mañana?"


Re: INSERT extremely slow with large data sets

От
"Dann Corbit"
Дата:
> -----Original Message-----
> From: Slavisa Garic [mailto:Slavisa.Garic@infotech.monash.edu.au]
> Sent: Thursday, November 13, 2003 11:37 PM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] INSERT extremely slow with large data sets
>
>
> Hi Everyone,
>
> This is my first post here so please tell me to go somewhere
> else if this is the wrong place to post questions like this.
>
> I am using PostgreSQL 7.3.2 and have used earlier versions
> (7.1.x onwards) and with all of them I noticed same problem
> with INSERTs when there is a large data set. Just to so you
> guys can compare time it takes to insert one row into a table
> when there are only few rows present and when there are thousands:
>
> Rows Present        Start Time        Finish Time
> ------------------------------------------------------------
> 100            1068790804.12           1068790804.12
> 1000            1068790807.87          1068790807.87
> 5000            1068790839.26        1068790839.27
> 10000            1068790909.24          1068790909.26
> 20000            1068791172.82        1068791172.85
> 30000            1068791664.06        1068791664.09
> 40000            1068792369.94        1068792370.0
> 50000            1068793317.53        1068793317.6
> 60000            1068794369.38        1068794369.47
>
> As you can see if takes awfully lots of time for me just to
> have those values inserted. Now to make a picture a bit
> clearer for you this table has lots of information in there,
> about 25 columns. Also there are few indexes that I created
> so that the process of selecting values from there is faster
> which by the way works fine. Selecting anything takes under 5 seconds.
>
> Any help would be greatly appreciated even pointing me in the
> right direction where to ask this question. By the way I
> designed the database this way as my application that uses
> PGSQL a lot during the execution so there was a huge need for
> fast SELECTs. Our experiments are getting larger and larger
> every day so fast inserts would be good as well.
>
> Just to note those times above are of INSERTs only. Nothing
> else done that would be included in those times. Machine was
> also free and that was the only process running all the time
> and the machine was Intel(R) Pentium(R) 4 CPU 2.40GHz.

You should post the schema for the table in question when you ask a
question like this.

The behavior is not surprising in the least bit.  Every database will
perform in this way, since you have mentioned that you have indexes on
the table.

The depth of the tree will be proportional to the log of the row count.
As the tree gets deeper, inserts will be more and more expensive.

If you have a giant pile of stuff to insert, consider the COPY command
or API if it is time critical.


Re: INSERT extremely slow with large data sets

От
Slavisa Garic
Дата:

On Fri, 14 Nov 2003, Alvaro Herrera wrote:

> On Fri, Nov 14, 2003 at 06:36:41PM +1100, Slavisa Garic wrote:
>
> > Rows Present        Start Time        Finish Time
> > ------------------------------------------------------------
> > 100            1068790804.12           1068790804.12
> > 1000            1068790807.87          1068790807.87
> > 5000            1068790839.26        1068790839.27
> > 10000            1068790909.24          1068790909.26
> > 20000            1068791172.82        1068791172.85
> > 30000            1068791664.06        1068791664.09
> > 40000            1068792369.94        1068792370.0
> > 50000            1068793317.53        1068793317.6
> > 60000            1068794369.38        1068794369.47
>
> > [too slow]
>
> Ok, so inserting 60000 rows seems to take 0.09 seconds, and inserting
> 5000 takes only 0.01.  And your problem is exactly what?

You didn't understand the question. Inserting ONE ROW when there are already
5000 ROWS present takes 0.01 seconds. Inserting ONE ROW when there are
already 60000 ROWS present takes 0.09 secods. In other words in takes
about 9 times more time to insert ONE ROW when there is a larger set of
data already in the database. As my experiments will grow and more data
will be inserted this is getting to take too long. Inserting 70000 rows
takes about just over an hour. INserting 5000 takes about minute and an
half.

I don't know if this the behaviour to be expected so that is why i posted
the question and that is my problem. I also wanted to know what can be
done to improve this if it can be,

Regards,
Slavisa
> --
> Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
> "Coge la flor que hoy nace alegre, ufana. �Qui�n sabe si nacera otra ma�ana?"
>



Re: INSERT extremely slow with large data sets

От
Slavisa Garic
Дата:
Hi Dann

Here is the schema and also could you just be more specific on COPY
command. ALso does talking dirrectly to API speed things up ? (I am new to
databases but i am learning quickly) 
-- NimrodEnfJob --create table NimrodEnfJob(    exp_id  INTEGER not null references NimrodEnfExperiment,    task_id
INTEGERnot null references NimrodTask,    pgroup_id   INTEGER not null references NimrodParameterGroup,    agent_id
INTEGERreferences NimrodAgent on delete set null,    jobname varchar(255) not null,    admin   char(1) not null default
'F'           check (admin in ('F', 'T')),    taskname    varchar(255) not null,    param_text  TEXT not null,    open
 char(1) not null default 'F'            check (open in ('F', 'T')),    control varchar(8) not null default 'start'
      check (control in ('nostart', 'start', 'stop')),    status  varchar(16) not null default 'ready'            check
(statusin ('ready', 'executing', 'failed',                'done')),    cpulength   real not null,    sleeptime   real
notnull,    filesize    real not null,    cputime   real,    waittime    real,    filetime    real,    filebytes
integer,   priority    integer not null default 100,    create_time timestamp not null default CURRENT_TIMESTAMP,
start_timetimestamp,    finish_time timestamp,    budget      real not null default 0.0,    servername  varchar(255),
error_info  varchar(255) not null default '',    more_info   TEXT not null default '',    primary key (exp_id,
jobname),   foreign key (exp_id, taskname) references NimrodEnfTask);
 

Also these are the indexes on this table. I created them on the columnt
that are most commonly accessed:create unique index nej_idxON NimrodEnfJob (exp_id, pgroup_id);create unique index
nej_idx1ONNimrodEnfJob (pgroup_id);create index nej_idx2ON NimrodEnfJob (status);create unique index nej_idx3ON
NimrodEnfJob(status, pgroup_id);create index nej_idx4ON NimrodEnfJob (status, agent_id);create index nej_idx5ON
NimrodEnfJob(agent_id);
 

I did notice that removing those indexes doesn't import by much. Similar
behaviour is observed but it just takes a bit less time to insert (0.01
less then usually at 60000 records) 

Regards,
Slavisa

On Fri, 14 Nov 2003, Dann Corbit
wrote:

> > -----Original Message-----
> > From: Slavisa Garic [mailto:Slavisa.Garic@infotech.monash.edu.au] 
> > Sent: Thursday, November 13, 2003 11:37 PM
> > To: pgsql-hackers@postgresql.org
> > Subject: [HACKERS] INSERT extremely slow with large data sets
> > 
> > 
> > Hi Everyone,
> > 
> > This is my first post here so please tell me to go somewhere 
> > else if this is the wrong place to post questions like this.
> > 
> > I am using PostgreSQL 7.3.2 and have used earlier versions 
> > (7.1.x onwards) and with all of them I noticed same problem 
> > with INSERTs when there is a large data set. Just to so you 
> > guys can compare time it takes to insert one row into a table 
> > when there are only few rows present and when there are thousands:
> > 
> > Rows Present        Start Time        Finish Time
> > ------------------------------------------------------------
> > 100            1068790804.12           1068790804.12
> > 1000            1068790807.87          1068790807.87
> > 5000            1068790839.26        1068790839.27
> > 10000            1068790909.24          1068790909.26
> > 20000            1068791172.82        1068791172.85
> > 30000            1068791664.06        1068791664.09 
> > 40000            1068792369.94        1068792370.0
> > 50000            1068793317.53        1068793317.6
> > 60000            1068794369.38        1068794369.47
> > 
> > As you can see if takes awfully lots of time for me just to 
> > have those values inserted. Now to make a picture a bit 
> > clearer for you this table has lots of information in there, 
> > about 25 columns. Also there are few indexes that I created 
> > so that the process of selecting values from there is faster 
> > which by the way works fine. Selecting anything takes under 5 seconds.
> > 
> > Any help would be greatly appreciated even pointing me in the 
> > right direction where to ask this question. By the way I 
> > designed the database this way as my application that uses 
> > PGSQL a lot during the execution so there was a huge need for 
> > fast SELECTs. Our experiments are getting larger and larger 
> > every day so fast inserts would be good as well.
> > 
> > Just to note those times above are of INSERTs only. Nothing 
> > else done that would be included in those times. Machine was 
> > also free and that was the only process running all the time 
> > and the machine was Intel(R) Pentium(R) 4 CPU 2.40GHz.
> 
> You should post the schema for the table in question when you ask a
> question like this.
> 
> The behavior is not surprising in the least bit.  Every database will
> perform in this way, since you have mentioned that you have indexes on
> the table.
> 
> The depth of the tree will be proportional to the log of the row count.
> As the tree gets deeper, inserts will be more and more expensive.
> 
> If you have a giant pile of stuff to insert, consider the COPY command
> or API if it is time critical.
> 



Re: INSERT extremely slow with large data sets

От
"Dann Corbit"
Дата:
> -----Original Message-----
> From: Slavisa Garic [mailto:Slavisa.Garic@infotech.monash.edu.au]
> Sent: Friday, November 14, 2003 5:12 PM
> To: Dann Corbit
> Cc: Slavisa Garic; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] INSERT extremely slow with large data sets
>
>
> Hi Dann
>
> Here is the schema and also could you just be more specific
> on COPY command.

http://www.postgresql.org/docs/7.3/static/sql-copy.html

And

http://techdocs.postgresql.org/techdocs/usingcopy.php

May be helpful.

> ALso does talking dirrectly to API speed
> things up ? (I am new to databases but i am learning quickly)

Not particularly.  This is the copy command API:
http://www.postgresql.org/docs/7.3/static/libpq-copy.html

What the API can allow you to do (for instance) would be to never let
the data touch the ground.  Instead of writing to a text file or even a
binary format copy input file, you use the API to take the incoming data
and insert it directly.

Like everything else, there is a dark side.  Read the documents and they
will explain it.  But if you need to move a giant pile of data into the
database as fast as possible, it is the copy command that is the most
efficient.
>     -- NimrodEnfJob --
>
>     create table NimrodEnfJob(
>         exp_id  INTEGER not null references NimrodEnfExperiment,
>         task_id INTEGER not null references NimrodTask,
>         pgroup_id   INTEGER not null references
> NimrodParameterGroup,
>         agent_id    INTEGER references NimrodAgent on
> delete set null,
>         jobname varchar(255) not null,
>         admin   char(1) not null default 'F'
>                 check (admin in ('F', 'T')),
>         taskname    varchar(255) not null,
>         param_text  TEXT not null,
>         open    char(1) not null default 'F'
>                 check (open in ('F', 'T')),
>         control varchar(8) not null default 'start'
>                 check (control in ('nostart', 'start', 'stop')),
>         status  varchar(16) not null default 'ready'
>                 check (status in ('ready', 'executing', 'failed',
>                     'done')),
>         cpulength   real not null,
>         sleeptime   real not null,
>         filesize    real not null,
>         cputime   real,
>         waittime    real,
>         filetime    real,
>         filebytes   integer,
>         priority    integer not null default 100,
>         create_time timestamp not null default CURRENT_TIMESTAMP,
>         start_time timestamp,
>         finish_time timestamp,
>         budget      real not null default 0.0,
>         servername  varchar(255),
>         error_info  varchar(255) not null default '',
>         more_info   TEXT not null default '',
>         primary key (exp_id, jobname),
>         foreign key (exp_id, taskname) references NimrodEnfTask
>     );
>
> Also these are the indexes on this table. I created them on
> the columnt that are most commonly accessed:
>     create unique index nej_idx
>     ON NimrodEnfJob (exp_id, pgroup_id);
>
>     create unique index nej_idx1
>     ON NimrodEnfJob (pgroup_id);
>
>     create index nej_idx2
>     ON NimrodEnfJob (status);
>
>     create unique index nej_idx3
>     ON NimrodEnfJob (status, pgroup_id);
>
>     create index nej_idx4
>     ON NimrodEnfJob (status, agent_id);
>
>     create index nej_idx5
>     ON NimrodEnfJob (agent_id);
>
> I did notice that removing those indexes doesn't import by
> much. Similar behaviour is observed but it just takes a bit
> less time to insert (0.01 less then usually at 60000 records)

I am quite surprised that removing the indexes does not have a large
impact on insert speed, especially, since you have 6 of them.  Most of
the other costs that I can think of are fixed for inserts into a "bare
table".  Perhaps someone with more intimate knowledge of the inner
working may know why inserts into a table without any index will trail
off in speed as the table grows.


[snip]


Re: INSERT extremely slow with large data sets

От
Tom Lane
Дата:
Slavisa Garic <Slavisa.Garic@infotech.monash.edu.au> writes:
> You didn't understand the question. Inserting ONE ROW when there are already
> 5000 ROWS present takes 0.01 seconds. Inserting ONE ROW when there are
> already 60000 ROWS present takes 0.09 secods.

The numbers you presented didn't really offer any strong grounds for
believing that there's an O(N) growth rate --- as far as I can see your
results are only barely out of the measurement-noise category.  Can you
run some tests where the issue is not in the least significant digit
of the available numbers?

But assuming for the moment that you've got hold of a real problem...

The actual insertion of a row should be essentially a constant-time
operation, since we just stick it into the last page of the table
(or any page with sufficient free space).  Insertion of index entries
for the row would have cost that depends on the number of existing
table entries, but for btree indexes I'd expect the cost to vary as
O(log2(N)) not O(N).  I do not think you've presented enough evidence
to prove that you're seeing linear rather than log-N cost growth.

Most of the serious insertion-cost problems we've seen lately have
to do with the costs of checking foreign key references ... but those
normally vary with the size of the referenced table, not the table into
which you're inserting.  Besides which you mentioned nothing about
foreign keys ... or any other schema details as far as I saw ...
        regards, tom lane