Обсуждение: How to determine cause of performance problem?

Поиск

Список

Период

Сортировка

How to determine cause of performance problem?

От

Joost Kraaijeveld

Дата:

23 сентября 2005 г., 06:58:45

Hi,

I must convert an old table into a new table. The conversion goes at ~
100 records per second. Given the fact that I must convert 40 million
records, it takes too much time: more hours than the 48 hour weekend I
have for the conversion;-).

The tables are rather simple: both tables only have a primary key
constraint (of type text) and no other indexes. I only copy 3 columns. I
use Java for the conversion. For the exact code see below.

During the conversion my processor load is almost non existant. The
harddisk throughput is ~ 6 megabyte/second max (measured with iostat).

My platform is Debian Sarge AMD64. My hardware is a Tyan Thunder K8W
2885 motherboard, 2 Opteron 248 processors, 2 GB RAM, a SATA bootdisk
with / and swap, and a 3Ware 9500S-8 RAID-5 controller with 5 attached
SATA disks with /home and /var. /var contains *all* PostgreSQL log and
database files (default Debian installation).

Output of hdparm -Tt /dev/sdb (sdb is the RAID opartition)

/dev/sdb:
 Timing cached reads:   1696 MB in  2.00 seconds = 846.86 MB/sec
 Timing buffered disk reads:  246 MB in  3.01 seconds =  81.79 MB/sec


I want to determine the cause of my performance problem (if it is one).

1. Is this a performance I can expect?
2. If not, how can I determine the cause?
3. Can I anyhow improve the performance without replacing my hardware,
e.g. by tweaking the software?
4. Is there a Linux (Debian) tool that I can use to benchmark write
performance?



--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl


The Java code I use for the conversion  :

//////////////// ....
ResultSet resultSet = selectStatement.executeQuery(
"select ordernummer, orderdatum, klantnummer from odbc.orders order by
ordernummer");

connection.setAutoCommit(false);

PreparedStatement ordersInsertStatement =
connection.prepareStatement("insert into prototype.orders
(objectid,ordernumber,orderdate,customernumber) values (?,?,?,?)");

while( resultSet.next() )
{

if( (++record % 100) == 0){
    System.err.println( "handling record: " + record);
}

// the next line can do > 1.000.000 objectId/sec
String orderObjectId = ObjectIdGenerator.newObjectId();
ordersInsertStatement.setString(1,orderObjectId);
ordersInsertStatement.setInt(2,resultSet.getInt("ordernummer"));
ordersInsertStatement.setDate(3,resultSet.getDate("orderdatum"));
ordersInsertStatement.setInt(4,resultSet.getInt("klantnummer"));

ordersInsertStatement.execute();

}

connection.commit();

Re: How to determine cause of performance problem?

От

Ulrich Wisser

Дата:

23 сентября 2005 г., 09:31:35

Hi Joost,

why do you convert programmatically? I would do something like

create sequence s_objectid;

insert into
prototype.orders(objectid,ordernumber,orderdate,customernumber)
select next_val('s_objectid'),ordernummer, orderdatum, klantnummer from
odbc.orders


Sounds a lot faster to me.


/Ulrich

Re: How to determine cause of performance problem?

От

Michael Stone

Дата:

23 сентября 2005 г., 09:55:27

On Fri, Sep 23, 2005 at 08:49:27AM +0200, Joost Kraaijeveld wrote:
>3. Can I anyhow improve the performance without replacing my hardware,
>e.g. by tweaking the software?

It's not clear what your object id generator does. If it's just a
sequence, it's not clear that you need this program at all--just use a
SELECT INTO and make the object id a SERIAL.

If you do need to control the object id or do some other processing
before putting the data into the new table, rewrite to use a COPY
instead of an INSERT.

Mike Stone

Re: How to determine cause of performance problem?

От

Joost Kraaijeveld

Дата:

23 сентября 2005 г., 10:21:23

On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote:
> It's not clear what your object id generator does. If it's just a
> sequence, it's not clear that you need this program at all--just use a
> SELECT INTO and make the object id a SERIAL.
It generates a GUID (and no, I do not want to turn this in a discussion
about GUIDs). As in the Java code comment: it is not the generation of
the GUID that is the problem (that is, I can generate millions of them
per second.)

> If you do need to control the object id or do some other processing
> before putting the data into the new table, rewrite to use a COPY
> instead of an INSERT.
It is actually the shortest piece of code that gives me a poor
performance. The conversion problem is much, much larger and much much
more complicated.

I suspect that either my hardware is to slow (but then again, see the
specs), or my Debian is to slow, or my PostgreSQL settings are wrong.

But I have no clue where to begin with determining the bottleneck (it
even may be a normal performance for all I know: I have no experience
with converting such (large) database).

Any suggestions?


--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: How to determine cause of performance problem?

От

Michael Stone

Дата:

23 сентября 2005 г., 11:05:07

On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote:
>On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote:
>> It's not clear what your object id generator does. If it's just a
>> sequence, it's not clear that you need this program at all--just use a
>> SELECT INTO and make the object id a SERIAL.
>It generates a GUID (and no, I do not want to turn this in a discussion
>about GUIDs). As in the Java code comment: it is not the generation of
>the GUID that is the problem (that is, I can generate millions of them
>per second.)

I didn't say it was, did I? If you use a SELECT INTO instead of
SELECTing each record and then reINSERTing it you avoid a round trip
latency for each row. There's a reason I said "if it's just a sequence".

>> If you do need to control the object id or do some other processing
>> before putting the data into the new table, rewrite to use a COPY
>> instead of an INSERT.
>It is actually the shortest piece of code that gives me a poor
>performance. The conversion problem is much, much larger and much much
>more complicated.

Ok, that's great, but you didn't respond to the suggestion of using COPY
INTO instead of INSERT.

>But I have no clue where to begin with determining the bottleneck (it
>even may be a normal performance for all I know: I have no experience
>with converting such (large) database).
>
>Any suggestions?

Respond to the first suggestion?

Mike Stone

Re: How to determine cause of performance problem?

От

Arjen van der Meijden

Дата:

23 сентября 2005 г., 11:19:49

On 23-9-2005 13:05, Michael Stone wrote:
> On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote:
>
> Ok, that's great, but you didn't respond to the suggestion of using COPY
> INTO instead of INSERT.
>
>> But I have no clue where to begin with determining the bottleneck (it
>> even may be a normal performance for all I know: I have no experience
>> with converting such (large) database).
>>
>> Any suggestions?
>
>
> Respond to the first suggestion?

Another suggestion:
How many indexes and constraints are on the new table?
Drop all of them and recreate them once the table is filled. Of course
that only works if you know your data will be ok (which is normal for
imports of already conforming data like database dumps of existing tables).
This will give major performance improvements, if you have indexes and
such on the new table.

Best regards,

Arjen

Re: How to determine cause of performance problem?

От

Dave Cramer

Дата:

23 сентября 2005 г., 11:31:05

Joost,

I presume you are using a relatively new jdbc driver. Make sure you
have added prepareThreshold=1 to the url to that it will use a named
server side prepared statement

You could also use your mod 100 code block to implement batch
processing of the inserts.

see addBatch, in  jdbc specs

Dave

On 23-Sep-05, at 2:49 AM, Joost Kraaijeveld wrote:

> Hi,
>
> I must convert an old table into a new table. The conversion goes at ~
> 100 records per second. Given the fact that I must convert 40 million
> records, it takes too much time: more hours than the 48 hour weekend I
> have for the conversion;-).
>
> The tables are rather simple: both tables only have a primary key
> constraint (of type text) and no other indexes. I only copy 3
> columns. I
> use Java for the conversion. For the exact code see below.
>
> During the conversion my processor load is almost non existant. The
> harddisk throughput is ~ 6 megabyte/second max (measured with iostat).
>
> My platform is Debian Sarge AMD64. My hardware is a Tyan Thunder K8W
> 2885 motherboard, 2 Opteron 248 processors, 2 GB RAM, a SATA bootdisk
> with / and swap, and a 3Ware 9500S-8 RAID-5 controller with 5 attached
> SATA disks with /home and /var. /var contains *all* PostgreSQL log and
> database files (default Debian installation).
>
> Output of hdparm -Tt /dev/sdb (sdb is the RAID opartition)
>
> /dev/sdb:
>  Timing cached reads:   1696 MB in  2.00 seconds = 846.86 MB/sec
>  Timing buffered disk reads:  246 MB in  3.01 seconds =  81.79 MB/sec
>
>
> I want to determine the cause of my performance problem (if it is
> one).
>
> 1. Is this a performance I can expect?
> 2. If not, how can I determine the cause?
> 3. Can I anyhow improve the performance without replacing my hardware,
> e.g. by tweaking the software?
> 4. Is there a Linux (Debian) tool that I can use to benchmark write
> performance?
>
>
>
> --
> Groeten,
>
> Joost Kraaijeveld
> Askesis B.V.
> Molukkenstraat 14
> 6524NB Nijmegen
> tel: 024-3888063 / 06-51855277
> fax: 024-3608416
> e-mail: J.Kraaijeveld@Askesis.nl
> web: www.askesis.nl
>
>
> The Java code I use for the conversion  :
>
> //////////////// ....
> ResultSet resultSet = selectStatement.executeQuery(
> "select ordernummer, orderdatum, klantnummer from odbc.orders order by
> ordernummer");
>
> connection.setAutoCommit(false);
>
> PreparedStatement ordersInsertStatement =
> connection.prepareStatement("insert into prototype.orders
> (objectid,ordernumber,orderdate,customernumber) values (?,?,?,?)");
>
> while( resultSet.next() )
> {
>
> if( (++record % 100) == 0){
>     System.err.println( "handling record: " + record);
> }
>
> // the next line can do > 1.000.000 objectId/sec
> String orderObjectId = ObjectIdGenerator.newObjectId();
> ordersInsertStatement.setString(1,orderObjectId);
> ordersInsertStatement.setInt(2,resultSet.getInt("ordernummer"));
> ordersInsertStatement.setDate(3,resultSet.getDate("orderdatum"));
> ordersInsertStatement.setInt(4,resultSet.getInt("klantnummer"));
>
> ordersInsertStatement.execute();
>
> }
>
> connection.commit();
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>
>

Re: How to determine cause of performance problem?

От

Joost Kraaijeveld

Дата:

23 сентября 2005 г., 13:36:02

On Fri, 2005-09-23 at 13:19 +0200, Arjen van der Meijden wrote:
> Another suggestion:
> How many indexes and constraints are on the new table?
As mentioned in the first mail: in this tables only primary key
constraints, no other indexes or constraints.

> Drop all of them and recreate them once the table is filled. Of course
> that only works if you know your data will be ok (which is normal for
> imports of already conforming data like database dumps of existing tables).
> This will give major performance improvements, if you have indexes and
> such on the new table.
I will test this a for perfomance improvement, but still, I wonder if ~
100 inserts/second is a reasonable performance for my software/hardware
combination.

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: How to determine cause of performance problem?

От

Joost Kraaijeveld

Дата:

23 сентября 2005 г., 13:49:27

On Fri, 2005-09-23 at 07:05 -0400, Michael Stone wrote:
> On Fri, Sep 23, 2005 at 12:21:15PM +0200, Joost Kraaijeveld wrote:
> >On Fri, 2005-09-23 at 05:55 -0400, Michael Stone wrote:
> I didn't say it was, did I?
No, you did not. But only last week someon'es head was (luckely for him
only virtually) almost chopped off for suggesting the usage of GUIDs ;-)


> Ok, that's great, but you didn't respond to the suggestion of using COPY
> INTO instead of INSERT.
Part of the code I left out are some data conversions (e.g. from
path-to-file to blob, from text to date (not castable because of the
homebrew original format)). I don't believe that I can do these in a SQL
statement, can I (my knowledge of SQL as a langage is not that good)? .
However I will investigate if I can do the conversion in two steps and
check if it is faster.

But still, I wonder if ~100 inserts/second is a reasonable performance
for my software/hardware combination.


--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: How to determine cause of performance problem?

От

Arjen van der Meijden

Дата:

23 сентября 2005 г., 14:06:51

On 23-9-2005 15:35, Joost Kraaijeveld wrote:
> On Fri, 2005-09-23 at 13:19 +0200, Arjen van der Meijden wrote:
>>Drop all of them and recreate them once the table is filled. Of course
>>that only works if you know your data will be ok (which is normal for
>>imports of already conforming data like database dumps of existing tables).
>>This will give major performance improvements, if you have indexes and
>>such on the new table.
>
> I will test this a for perfomance improvement, but still, I wonder if ~
> 100 inserts/second is a reasonable performance for my software/hardware
> combination.

For the hardware: no, I don't think it is for such a simple table/small
recordsize.
I did a few batch-inserts with indexes on tables and was very
disappointed about the time it took. But with no indexes and constraints
left it flew and the table of 7 million records (of 3 ints and 2
bigints) was imported in 75 seconds, on a bit simpler hardware. That was
done using a simple pg_dump-built sql-file which was then fed to psql as
input. And of course that used the local unix socket, not the local
network interface (I don't know which jdbc takes).
But generating a single transaction (as you do) with inserts shouldn't
be that much slower.

So I don't think its your hardware, nor your postgresql, although a bit
extra maintenance_work_mem may help, if you haven't touched that.
Leaving the queries, the application and the driver. But I don't have
that much experience with jdbc and postgresql-performance. In php I
wouldn't select all the 40M records at once, the resultset would be in
the clients-memory and that may actually cause trouble. But I don't know
how that is implemented in JDBC, it may of course be using cursors and
it would be less of a problem than perhaps.
You could try writing the inserts to file and see how long that takes,
to eliminate the possibility of your application being slow on other
parts than the inserting of data. If that is fast enough, a last resort
may be to write a csv-file from java and use that with a copy-statement
in psql ;)

Best regards,

Arjen

Re: How to determine cause of performance problem?

От

Tom Lane

Дата:

23 сентября 2005 г., 14:33:09

Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> I will test this a for perfomance improvement, but still, I wonder if ~
> 100 inserts/second is a reasonable performance for my software/hardware
> combination.

Is the client code running on the same machine as the database server?
If not, what's the network delay and latency between them?

The major problem you're going to have here is at least one network
round trip per row inserted --- possibly more, if the jdbc driver is
doing "helpful" stuff behind your back like starting/committing
transactions.

            regards, tom lane

Re: How to determine cause of performance problem?

От

Joost Kraaijeveld

Дата:

23 сентября 2005 г., 14:47:10

On Fri, 2005-09-23 at 10:33 -0400, Tom Lane wrote:
> Is the client code running on the same machine as the database server?
> If not, what's the network delay and latency between them?
Yes, it is running on the same machine.


> The major problem you're going to have here is at least one network
> round trip per row inserted --- possibly more, if the jdbc driver is
> doing "helpful" stuff behind your back like starting/committing
> transactions.
OK, I will look into that.

But do you maybe know a pointer to info, or tools that can measure, what
my machine is doing during all the time it is doing nothing? Something
like the performance monitor in Windows but than for Linux?

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: How to determine cause of performance problem?

От

Tom Lane

Дата:

23 сентября 2005 г., 14:50:40

Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> But do you maybe know a pointer to info, or tools that can measure, what
> my machine is doing during all the time it is doing nothing? Something
> like the performance monitor in Windows but than for Linux?

top, vmstat, iostat, sar, strace, oprofile, ...

            regards, tom lane

Re: How to determine cause of performance problem?

От

Michael Stone

Дата:

23 сентября 2005 г., 15:31:33

On Fri, Sep 23, 2005 at 03:49:25PM +0200, Joost Kraaijeveld wrote:
>On Fri, 2005-09-23 at 07:05 -0400, Michael Stone wrote:
>> Ok, that's great, but you didn't respond to the suggestion of using COPY
>> INTO instead of INSERT.
>Part of the code I left out are some data conversions (e.g. from
>path-to-file to blob, from text to date (not castable because of the
>homebrew original format)). I don't believe that I can do these in a SQL
>statement, can I (my knowledge of SQL as a langage is not that good)? .
>However I will investigate if I can do the conversion in two steps and
>check if it is faster.

I'm not sure what you're trying to say.

You're currently putting rows into the table by calling "INSERT INTO"
for each row. The sample code you send could be rewritten to use "COPY
INTO" instead.  For bulk inserts like you're doing, the copy approach
will be a lot faster.  Instead of inserting one row, waiting for a
reply, and inserting the next row, you just cram data down a pipe to the
server.

See:
http://www.postgresql.org/docs/8.0/interactive/sql-copy.html
http://www.faqs.org/docs/ppbook/x5504.htm

Mike Stone

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: How to determine cause of performance problem?