Обсуждение: Insert performance

От:
"hatman"
Дата:

Dear all,

After many tests and doc reading, i finally try to get help from
you...

Here is my problem. With some heavy insert into a simple BD (one
table, no indexes) i can't get better perf than 8000 inserts/sec. I'm
testing it using a simple C software which use libpq and which use:
- Insert prepared statement (to avoid too many request parsing on the
server)
- transaction of 100000 inserts

My server which has the following config:
- 3G RAM
- Pentium D - 64 bits, 3Ghz
- database data on hardware raid 0 disks
- x_log (WAL logs) on an other single hard drive

The server only use 30% of the CPU, 10% of disk access and not much
RAM... So i'm wondering where could be the bottle neck and why i can't
get better performance ?
I really need to use inserts and i can't change it to use COPY...

Any advice is welcome. Sorry in advance for my bad understanding of
database !

Thanks in advance.

Regards,


Joël.W


От:
Richard Huxton
Дата:

hatman wrote:
> Dear all,
>
> After many tests and doc reading, i finally try to get help from
> you...
>
> Here is my problem. With some heavy insert into a simple BD (one
> table, no indexes) i can't get better perf than 8000 inserts/sec. I'm
> testing it using a simple C software which use libpq and which use:
> - Insert prepared statement (to avoid too many request parsing on the
> server)
> - transaction of 100000 inserts

Are each of the INSERTs in their own transaction?

If so, you'll be limited by the speed of the disk the WAL is running on.

That means you have two main options:
1. Have multiple connections inserting simultaneously.
2. Batch your inserts together, from 10 to 10,000 per transaction.

Are either of those possible?

--
   Richard Huxton
   Archonet Ltd

От:
Richard Huxton
Дата:

joël Winteregg wrote:
> Hi Richard,
>
>>> Here is my problem. With some heavy insert into a simple BD (one
>>> table, no indexes) i can't get better perf than 8000 inserts/sec. I'm
>>> testing it using a simple C software which use libpq and which use:
>>> - Insert prepared statement (to avoid too many request parsing on the
>>> server)
>>> - transaction of 100000 inserts
>> Are each of the INSERTs in their own transaction?
>>
>
> No, as said above transactions are made of 100000 inserts...

Hmm - I read that as just meaning "inserted 100000 rows". You might find
that smaller batches provide peak performance.

>> If so, you'll be limited by the speed of the disk the WAL is running on.
>>
>> That means you have two main options:
>> 1. Have multiple connections inserting simultaneously.
>
> Yes, you're right. That what i have been testing and what provide the
> best performance ! I saw that postgresql frontend was using a lot of CPU
> and not both of them (i'm using a pentium D, dual core). To the opposit,
> the postmaster process use not much resources. Using several client,
> both CPU are used and i saw an increase of performance (about 18000
> inserts/sec).
>
> So i think my bottle neck is more the CPU speed than the disk speed,
> what do you think ?

Well, I think it's fair to say it's not disk. Let's see - the original
figure was 8000 inserts/sec, which is 0.125ms per insert. That sounds
plausible to me for a round-trip to process a simple command - are you
running the client app on the same machine, or is it over the network?

Two other things to bear in mind:
1. If you're running 8.2 you can have multiple sets of values in an INSERT
http://www.postgresql.org/docs/8.2/static/sql-insert.html

2. You can do a COPY from libpq - is it really not possible?

--
   Richard Huxton
   Archonet Ltd

От:
Richard Huxton
Дата:

joël Winteregg wrote:
>
>>> No, as said above transactions are made of 100000 inserts...
>> Hmm - I read that as just meaning "inserted 100000 rows". You might find
>> that smaller batches provide peak performance.
>
> Ahh ok ;-) sorry for my bad english... (yeah, i have been testing
> several transaction size 10000, 20000 and 100000)

Not your bad English, my poor reading :-)

>>>> If so, you'll be limited by the speed of the disk the WAL is running on.
>>>>
>>>> That means you have two main options:
>>>> 1. Have multiple connections inserting simultaneously.
>>> Yes, you're right. That what i have been testing and what provide the
>>> best performance ! I saw that postgresql frontend was using a lot of CPU
>>> and not both of them (i'm using a pentium D, dual core). To the opposit,
>>> the postmaster process use not much resources. Using several client,
>>> both CPU are used and i saw an increase of performance (about 18000
>>> inserts/sec).
>>>
>>> So i think my bottle neck is more the CPU speed than the disk speed,
>>> what do you think ?
>> Well, I think it's fair to say it's not disk. Let's see - the original
>> figure was 8000 inserts/sec, which is 0.125ms per insert. That sounds
>> plausible to me for a round-trip to process a simple command - are you
>> running the client app on the same machine, or is it over the network?
>
> I did both test. On the local machine (using UNIX sockets) i can reach
> 18000 insert/sec with 10 clients and prepared statements. The same test
> using clients on the remote machine provide me 13000 inserts/sec.

OK, so we know what the overhead for network connections is.

> Now, with multiple client (multi-threaded inserts) my both CPU are quite
> well used (both arround 90%) so i maybe think that disk speeds are now
> my bottleneck. What do you think ?  or maybe i will need a better CPU ?
>
>> Two other things to bear in mind:
>> 1. If you're running 8.2 you can have multiple sets of values in an INSERT
>> http://www.postgresql.org/docs/8.2/static/sql-insert.html
>
> Yeah, i'm running the 8.2.3 version ! i didn't know about multiple
> inserts sets ! Thanks for the tip ;-)

Ah-ha! Give it a go, it's designed for this sort of situation. Not sure
it'll manage thousands of value clauses, but working up from 10 perhaps.
I've not tested it for performance, so I'd be interesting in knowing how
it compares to your other results.

>> 2. You can do a COPY from libpq - is it really not possible?
>>
>
> Not really but i have been testing it and inserts are flying (about
> 100000 inserts/sec) !!

What's the problem with the COPY? Could you COPY into one table then
insert from that to your target table?

--
   Richard Huxton
   Archonet Ltd

От:
Andreas Kostyrka
Дата:

* Richard Huxton <> [070306 12:22]:
> >>2. You can do a COPY from libpq - is it really not possible?
> >>
> >Not really but i have been testing it and inserts are flying (about
> >100000 inserts/sec) !!
>
> What's the problem with the COPY? Could you COPY into one table then insert from that to your target table?
Well, there are some issues. First your client needs to support it.
E.g. psycopg2 supports only some specific CSV formatting in it's
methods. (plus I had sometimes random psycopg2 crashes, but guarding against
these is cheap compared to the speedup from COPY versus INSERT)
Plus you need to be sure that your data will apply cleanly (which in
my app was not the case), or you need to code a fallback that
localizes the row that doesn't work.

And the worst thing is, that it ignores RULES on the tables, which
sucks if you use them ;) (e.g. table partitioning).

Andreas

От:
Richard Huxton
Дата:

Andreas Kostyrka wrote:
> * Richard Huxton <> [070306 12:22]:
>>>> 2. You can do a COPY from libpq - is it really not possible?
>>>>
>>> Not really but i have been testing it and inserts are flying (about
>>> 100000 inserts/sec) !!
>> What's the problem with the COPY? Could you COPY into one table then insert from that to your target table?
> Well, there are some issues. First your client needs to support it.
> E.g. psycopg2 supports only some specific CSV formatting in it's
> methods. (plus I had sometimes random psycopg2 crashes, but guarding against
> these is cheap compared to the speedup from COPY versus INSERT)
> Plus you need to be sure that your data will apply cleanly (which in
> my app was not the case), or you need to code a fallback that
> localizes the row that doesn't work.
>
> And the worst thing is, that it ignores RULES on the tables, which
> sucks if you use them ;) (e.g. table partitioning).

Ah, but two things deal with these issues:
1. Joel is using libpq
2. COPY into a holding table, tidy data and INSERT ... SELECT

--
   Richard Huxton
   Archonet Ltd

От:
Andreas Kostyrka
Дата:

* Richard Huxton <> [070306 13:47]:
> Andreas Kostyrka wrote:
> >* Richard Huxton <> [070306 12:22]:
> >>>>2. You can do a COPY from libpq - is it really not possible?
> >>>>
> >>>Not really but i have been testing it and inserts are flying (about
> >>>100000 inserts/sec) !!
> >>What's the problem with the COPY? Could you COPY into one table then insert from that to your target table?
> >Well, there are some issues. First your client needs to support it.
> >E.g. psycopg2 supports only some specific CSV formatting in it's
> >methods. (plus I had sometimes random psycopg2 crashes, but guarding against
> >these is cheap compared to the speedup from COPY versus INSERT)
> >Plus you need to be sure that your data will apply cleanly (which in
> >my app was not the case), or you need to code a fallback that
> >localizes the row that doesn't work.
> >And the worst thing is, that it ignores RULES on the tables, which
> >sucks if you use them ;) (e.g. table partitioning).
>
> Ah, but two things deal with these issues:
> 1. Joel is using libpq
> 2. COPY into a holding table, tidy data and INSERT ... SELECT

Clearly COPY is the way for bulk loading data, BUT you asked, so I
wanted to point out some problems and brittle points with COPY.

(and the copy into the holding table doesn't solve completly the
problem with the dirty inconsistent data)

Andreas

От:
Carlos Moreno
Дата:

>>> 1. If you're running 8.2 you can have multiple sets of values in an
>>> INSERT
>>> http://www.postgresql.org/docs/8.2/static/sql-insert.html
>>
>>
>> Yeah, i'm running the 8.2.3 version ! i didn't know about multiple
>> inserts sets ! Thanks for the tip ;-)
>

No kidding --- thanks for the tip from me as well !!!

I didn't know this was possible (though I read in the docs that it is ANSI
SQL standard), and I'm also having a similar situation.

Two related questions:

1) What about atomicity?  Is it strictly equivalent to having multiple
insert
statements inside a transaction?  (I assume it should be)

2) What about the issue with excessive locking for foreign keys when
inside a transaction?  Has that issue disappeared in 8.2?  And if not,
would it affect similarly in the case of multiple-row inserts?

In case you have no clue what I'm referring to:

Say that we have a table A, with one foreign key constraint to table
B --- last time I checked, there was an issue that whenever inserting
or updating table A (inside a transacion), postgres sets an exclusive
access lock on the referenced row on table B --- this is overkill, and
the correct thing to do would be to set a read-only lock  (so that
no-one else can *modify or remove* the referenced row while the
transaction has not been finished).

This caused unnecessary deadlock situations --- even though no-one
is modifying table B (which is enough to guarantee that concurrent
transactions would be ok), a second transacion would fail to set the
exclusive access lock, since someone already locked it.

My solution was to sort the insert statements by the referenced value
on table B.

(I hope the above explanation clarifies what I'm trying to say)

I wonder if I should still do the same if I go with a multiple-row
insert instead of multiple insert statements inside a transaction.

Thanks,

Carlos
--


От:
Csaba Nagy
Дата:

I only know to answer your no. 2:
> 2) What about the issue with excessive locking for foreign keys when
> inside a transaction?  Has that issue disappeared in 8.2?  And if not,
> would it affect similarly in the case of multiple-row inserts?

The exclusive lock is gone already starting with 8.0 IIRC, a
non-exclusive lock on the parent row is used instead. Thing is that this
is still too strong ;-)

The proper lock would be one which only prevents modification of the
parent key, other updates would be safe on the same row.

In any case, the current behavior is much better than what was before.

Cheers,
Csaba.



От:
Carlos Moreno
Дата:

Csaba Nagy wrote:

>I only know to answer your no. 2:
>
>
>>2) What about the issue with excessive locking for foreign keys when
>>inside a transaction?  Has that issue disappeared in 8.2?  And if not,
>>would it affect similarly in the case of multiple-row inserts?
>>
>>
>
>The exclusive lock is gone already starting with 8.0 IIRC, a
>non-exclusive lock on the parent row is used instead. Thing is that this
>is still too strong ;-)
>
>The proper lock would be one which only prevents modification of the
>parent key, other updates would be safe on the same row.
>
>In any case, the current behavior is much better than what was before.
>
>

*Much* better, I would say --- though you're still correct in that it is
still
not the right thing to do.

In particular, with the previous approach. there was a serious performance
hit when concurrent transactions reference the same keys --- that is, after
having taken measures to avoid deadlocks, some transactions would have
to *wait* (for no good reason) until the other transaction is completed and
the exclusive-access lock is released.  For high-traffic databases this
can be
a quite severe performance hit.  I'm glad it has been fixed, even if only
partially.

Thanks,

Carlos
--


От:
"hatman"
Дата:

Hi Andreas,

Thanks for the info about COPY !!

On Mar 6, 1:23 pm, andr...@kostyrka.org (Andreas Kostyrka) wrote:
> * Richard Huxton <d...@archonet.com> [070306 12:22]:> >>2. You can do a COPY from libpq - is it really not possible?
>
> > >Not really but i have been testing it and inserts are flying (about
> > >100000 inserts/sec) !!
>
> > What's the problem with the COPY? Could you COPY into one table then insert from that to your target table?
>
> Well, there are some issues. First your client needs to support it.
> E.g. psycopg2 supports only some specific CSV formatting in it's
> methods. (plus I had sometimes random psycopg2 crashes, but guarding against
> these is cheap compared to the speedup from COPY versus INSERT)
> Plus you need to be sure that your data will apply cleanly (which in
> my app was not the case), or you need to code a fallback that
> localizes the row that doesn't work.
>
> And the worst thing is, that it ignores RULES on the tables, which
> sucks if you use them ;) (e.g. table partitioning).

Ok, but what about constraints (foreign keys and SERIAL id) using a
copy statement ? do we need to handle auto-generated id (SERIAL)
manually ?

Thanks for your feedback.

Regards,

Joël


От:
"hatman"
Дата:

Hi Richard,

>
> >>> No, as said above transactions are made of 100000 inserts...
> >> Hmm - I read that as just meaning "inserted 100000 rows". You might find
> >> that smaller batches provide peak performance.
>
> > Ahh ok ;-) sorry for my bad english... (yeah, i have been testing
> > several transaction size 10000, 20000 and 100000)
>
> Not your bad English, my poor reading :-)
>
>
>
> >>>> If so, you'll be limited by the speed of the disk the WAL is running on.
>
> >>>> That means you have two main options:
> >>>> 1. Have multiple connections inserting simultaneously.
> >>> Yes, you're right. That what i have been testing and what provide the
> >>> best performance ! I saw that postgresql frontend was using a lot of CPU
> >>> and not both of them (i'm using a pentium D, dual core). To the opposit,
> >>> the postmaster process use not much resources. Using several client,
> >>> both CPU are used and i saw an increase of performance (about 18000
> >>> inserts/sec).
>
> >>> So i think my bottle neck is more the CPU speed than the disk speed,
> >>> what do you think ?
> >> Well, I think it's fair to say it's not disk. Let's see - the original
> >> figure was 8000 inserts/sec, which is 0.125ms per insert. That sounds
> >> plausible to me for a round-trip to process a simple command - are you
> >> running the client app on the same machine, or is it over the network?
>
> > I did both test. On the local machine (using UNIX sockets) i can reach
> > 18000 insert/sec with 10 clients and prepared statements. The same test
> > using clients on the remote machine provide me 13000 inserts/sec.
>
> OK, so we know what the overhead for network connections is.
>
> > Now, with multiple client (multi-threaded inserts) my both CPU are quite
> > well used (both arround 90%) so i maybe think that disk speeds are now
> > my bottleneck. What do you think ?  or maybe i will need a better CPU ?
>
> >> Two other things to bear in mind:
> >> 1. If you're running 8.2 you can have multiple sets of values in an INSERT
> >>http://www.postgresql.org/docs/8.2/static/sql-insert.html
>
> > Yeah, i'm running the 8.2.3 version ! i didn't know about multiple
> > inserts sets ! Thanks for the tip ;-)
>


> Ah-ha! Give it a go, it's designed for this sort of situation. Not sure
> it'll manage thousands of value clauses, but working up from 10 perhaps.
> I've not tested it for performance, so I'd be interesting in knowing how
> it compares to your other results.

Yeah, as soon as possible i will give it a try ! Thanks for the
feedback ;-)

>
> >> 2. You can do a COPY from libpq - is it really not possible?
>
> > Not really but i have been testing it and inserts are flying (about
> > 100000 inserts/sec) !!
>
> What's the problem with the COPY? Could you COPY into one table then
> insert from that to your target table?

The main problem comes from our "real time" needs. We are getting
information as a data flow from several application and we need to
store them in the DB without buffering them too much...
I have been testing the COPY using several statement (i mean using
copy to add only a few rows to a specific table and then using it on
an other table to add a few rows, etc...) and the perf are as bad as
an insert !
COPY seems to be designed to add many many rows to the same table and
not a few rows to several tables... So that's my main problem.

Regards,

Joël


От:
joël Winteregg
Дата:

Hi and thanks for your quick answer :-)

> >
> >>> Here is my problem. With some heavy insert into a simple BD (one
> >>> table, no indexes) i can't get better perf than 8000 inserts/sec. I'm
> >>> testing it using a simple C software which use libpq and which use:
> >>> - Insert prepared statement (to avoid too many request parsing on the
> >>> server)
> >>> - transaction of 100000 inserts
> >> Are each of the INSERTs in their own transaction?
> >>
> >
> > No, as said above transactions are made of 100000 inserts...
>
> Hmm - I read that as just meaning "inserted 100000 rows". You might find
> that smaller batches provide peak performance.
>

Ahh ok ;-) sorry for my bad english... (yeah, i have been testing
several transaction size 10000, 20000 and 100000)


> >> If so, you'll be limited by the speed of the disk the WAL is running on.
> >>
> >> That means you have two main options:
> >> 1. Have multiple connections inserting simultaneously.
> >
> > Yes, you're right. That what i have been testing and what provide the
> > best performance ! I saw that postgresql frontend was using a lot of CPU
> > and not both of them (i'm using a pentium D, dual core). To the opposit,
> > the postmaster process use not much resources. Using several client,
> > both CPU are used and i saw an increase of performance (about 18000
> > inserts/sec).
> >
> > So i think my bottle neck is more the CPU speed than the disk speed,
> > what do you think ?
>
> Well, I think it's fair to say it's not disk. Let's see - the original
> figure was 8000 inserts/sec, which is 0.125ms per insert. That sounds
> plausible to me for a round-trip to process a simple command - are you
> running the client app on the same machine, or is it over the network?

I did both test. On the local machine (using UNIX sockets) i can reach
18000 insert/sec with 10 clients and prepared statements. The same test
using clients on the remote machine provide me 13000 inserts/sec.

Now, with multiple client (multi-threaded inserts) my both CPU are quite
well used (both arround 90%) so i maybe think that disk speeds are now
my bottleneck. What do you think ?  or maybe i will need a better CPU ?

>
> Two other things to bear in mind:
> 1. If you're running 8.2 you can have multiple sets of values in an INSERT
> http://www.postgresql.org/docs/8.2/static/sql-insert.html
>

Yeah, i'm running the 8.2.3 version ! i didn't know about multiple
inserts sets ! Thanks for the tip ;-)

> 2. You can do a COPY from libpq - is it really not possible?
>

Not really but i have been testing it and inserts are flying (about
100000 inserts/sec) !!




От:
joël Winteregg
Дата:

Hi Richard,

> >
> > Here is my problem. With some heavy insert into a simple BD (one
> > table, no indexes) i can't get better perf than 8000 inserts/sec. I'm
> > testing it using a simple C software which use libpq and which use:
> > - Insert prepared statement (to avoid too many request parsing on the
> > server)
> > - transaction of 100000 inserts
>
> Are each of the INSERTs in their own transaction?
>

No, as said above transactions are made of 100000 inserts...

> If so, you'll be limited by the speed of the disk the WAL is running on.
>
> That means you have two main options:
> 1. Have multiple connections inserting simultaneously.

Yes, you're right. That what i have been testing and what provide the
best performance ! I saw that postgresql frontend was using a lot of CPU
and not both of them (i'm using a pentium D, dual core). To the opposit,
the postmaster process use not much resources. Using several client,
both CPU are used and i saw an increase of performance (about 18000
inserts/sec).

So i think my bottle neck is more the CPU speed than the disk speed,
what do you think ?

I use 2 disks (raid 0) for the data and a single disk for pg_xlog.

> 2. Batch your inserts together, from 10 to 10,000 per transaction.
>

Yes, that's what i'm doing.


Thanks a lot for the advices !


regards,


Joël