Обсуждение: Inserting 26 million rows takes 8 hours, how to improve those times?

Поиск
Список
Период
Сортировка

Inserting 26 million rows takes 8 hours, how to improve those times?

От
Jose Vicente Nunez Z
Дата:
Greetings,

I'm trying to use PostgreSQL to manage big amounts of data; One of the
first things i'm testing is how fast PostgreSQL can load some big CSV
text files.


For that i'm using the PostgreSQL copy tool, but the problem is that is
taking almost 9 hours to load the data:

copy nb_cmo_deal_pools from '/postgres-system/datafile.txt' DELIMITERS
'|';

[root@linux0105 root]# time psql -Upostgres MYDB < load.sql

real    487m47.632s
user    0m0.020s
sys     0m0.000s
[root@linux0105 root]#

[root@linux0105 root]# cat
/postgres-system/datafile.txt|wc -l
26026965

I've already played with filesystem options on my ext3 system and i would like to know:

1) How i can tune PostgreSQL to improve the insertion speed (besides droping indexes and using the fsync option).
2) How 'safe' is to use the fsync=off option on a ext3 journaled system? The journal should give me some protection in
casethe system goes down, isn't it? 
3) I've read several web pages that talk about tunning the sort and buffer options in PosgreSQL, but no good example of
optimizationshere. Does anyone know where i can find more help (My test system has 2GB of RAM 
and i think 1.5 for the database will be fair enough).

I apologize if this not the proper place to post this questions.

Thanks in advance,


--
Jose Vicente Nunez Zuleta (josevnz at newbreak dot com)
Newbreak LLC System Administrator
http://www.newbreak.com
RHCE, SCJD, SCJP


Re: Inserting 26 million rows takes 8 hours, how to improve those times?

От
Tom Lane
Дата:
Jose Vicente Nunez Z <josevnz@newbreak.com> writes:
> For that i'm using the PostgreSQL copy tool, but the problem is that is
> taking almost 9 hours to load the data:

Hmm, 889 rows/second doesn't sound that bad, considering that you've
given us no data about the hardware you're running on.  In particular,
what's your disk setup?

But at any rate, the only bottlenecks I could think of for a COPY
command are:

1. Data conversion.  This would depend on the column datatypes of the
table, which you didn't tell us.

2. Index updating.  This would depend on the indexes on the table,
which you didn't tell us.

3. Foreign-key constraint checking.  This would depend on the foreign
keys the table has, which you didn't tell us.

In short ... you'd better provide a lot more detail if you want
useful commentary.

> I apologize if this not the proper place to post this questions.

pgsql-performance would be more appropriate, probably.

            regards, tom lane

Re: Inserting 26 million rows takes 8 hours, how to

От
Jose Vicente Nunez Z
Дата:
Tom,


On Tue, 2003-09-16 at 10:44, Tom Lane wrote:
> Jose Vicente Nunez Z <josevnz@newbreak.com> writes:
> > For that i'm using the PostgreSQL copy tool, but the problem is that is
> > taking almost 9 hours to load the data:
>
> Hmm, 889 rows/second doesn't sound that bad, considering that you've
> given us no data about the hardware you're running on.  In particular,
> what's your disk setup?

My fault, i should provide more information about my setup.

I got two ATA disks (one for PostgreSQL, the other for the data files).
The box is a Pentium III 1.3Ghz, dual CPU with 2 GB of RAM. One disk is
30 GB, the other is 40GB.

As you can see is not state of the art harware, but for development
purposes i should get a decent performance (only one or two applications
will be accessing the server to do some data mining).

>
> But at any rate, the only bottlenecks I could think of for a COPY
> command are:
>
> 1. Data conversion.  This would depend on the column datatypes of the
> table, which you didn't tell us.
>
create table mytable
(
     a         varchar(20) not null,
     b         varchar(20) not null,
     c         varchar(20) not null,
     d         char(6),
     f         int null,
     g         float not null,
     h         float not null
)

Data conversions shouldn't be that complicated, isn't it?

> 2. Index updating.  This would depend on the indexes on the table,
> which you didn't tell us.

I had one index and dropped it for the load. Also i turn of the 'fsync'
flag (fsync=false) and now the load time is only 15 minutes!. But then
recreating the original index takes forever:

create  index myindex on mytable ( a, b, c );

For some reason i don't see much I/O but a lot of CPU ussage:

 11:18:05  up 3 days, 20:55,  3 users,  load average: 1.00, 1.00, 0.92
51 processes: 49 sleeping, 2 running, 0 zombie, 0 stopped
CPU0 states:  96.0% user   3.0% system    0.0% nice   0.0% iowait   0.0%
idle
CPU1 states:   0.1% user   0.0% system    0.0% nice   0.0% iowait  99.0%
idle
Mem:  2064404k av, 2049312k used,   15092k free,       0k shrd,  149544k
buff
                    406064k actv, 1076876k in_d,  345612k in_c
Swap: 2047744k av,    4252k used, 2043492k free                 1809160k
cached

       
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU
COMMAND
 1882 postgres  25   0 41492  40M 36080 R    98.6  2.0  87:34   0
postmaster
16998 root      15   0     0    0     0 SW    1.9  0.0   4:30   0
kjournald
 1956 root      19   0  1152 1152   848 R     0.9  0.0   0:00   1 top
    1 root      15   0   476  448   420 S     0.0  0.0   0:07   1 init
    2 root      RT   0     0    0     0 SW    0.0  0.0   0:00   0
migration/0
    3 root      RT   0     0    0     0 SW    0.0  0.0   0:00   1
migration/1
    4 root      15   0     0    0     0 SW    0.0  0.0   0:00   1
keventd
    5 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0
ksoftirqd_CPU0
    6 root      34  19     0    0     0 SWN   0.0  0.0   0:00   1
ksoftirqd_CPU1
   11 root      19   0     0    0     0 SW    0.0  0.0   0:00   0
bdflush
    7 root      15   0     0    0     0 SW    0.0  0.0   0:22   0 kswapd
    8 root      15   0     0    0     0 SW    0.0  0.0   0:00   1
kscand/DMA
    9 root      15   0     0    0     0 SW    0.0  0.0   1:57   0
kscand/Normal
   10 root      15   0     0    0     0 SW    0.0  0.0   2:28   0
kscand/HighMem
   12 root      15   0     0    0     0 SW    0.0  0.0   0:04   0
kupdated

Also seems to be than memory usage is ok, i don't see any process
blocked or waiting for resources, nor swapped out due lack of memory:

[root@linux0105 data]# vmstat 10 10
   procs                      memory      swap          io
system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id
 1  0  0   4252  15092 149564 1809160    0    0    60    43    1    16
2  1 33
 1  0  0   4252  15092 149564 1809160    0    0     0  1820  134    12
49  1 50
 1  0  0   4252  15092 149564 1809160    0    0     0  1282  125     8
49  1 50
 1  0  0   4252  15092 149568 1809160    0    0     0  1202  124     9
50  1 50
 1  0  0   4252  15092 149572 1809160    0    0     0  1216  123     9
50  1 50
 1  0  0   4252  15092 149576 1809160    0    0     0  1224  125     8
50  0 50
 1  0  0   4252  15092 149580 1809160    0    0     0  1212  124    10
50  1 50
 1  0  0   4252  15092 149584 1809160    0    0     0  1206  123     8
50  1 50
 1  0  0   4252  15092 149588 1809160    0    0     0  1212  125     8
49  1 50
 1  0  0   4252  15092 149592 1809160    0    0     0  1208  125     9
49  1 50


Any ideas about what i can do to speed up the Index creation? Why is
using so much CPU (doesn't seem to be writting much information to disk
though).

>
> 3. Foreign-key constraint checking.  This would depend on the foreign
> keys the table has, which you didn't tell us.
>

None so far, is just one table.

> In short ... you'd better provide a lot more detail if you want
> useful commentary.
>
> > I apologize if this not the proper place to post this questions.
>
> pgsql-performance would be more appropriate, probably.

I'm reading some documentation (checking the list archives):

http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html

Hopefully this would help me to fix the problem.

Thanks for the help,

JV.

>
>             regards, tom lane
--
Jose Vicente Nunez Zuleta (josevnz at newbreak dot com)
Newbreak LLC System Administrator
http://www.newbreak.com
RHCE, SCJD, SCJP


Re: Inserting 26 million rows takes 8 hours, how to improve those times?

От
Tom Lane
Дата:
Jose Vicente Nunez Z <josevnz@newbreak.com> writes:
> create table mytable
> (
>      a         varchar(20) not null,
>      b         varchar(20) not null,
>      c         varchar(20) not null,
>      d         char(6),
>      f         int null,
>      g         float not null,
>      h         float not null
> )

> recreating the original index takes forever:

> create  index myindex on mytable ( a, b, c );

> For some reason i don't see much I/O but a lot of CPU ussage:

Hm.  I'll bet that you are using a non-C locale, such that varchar
comparisons depend on strcoll() instead of strcmp().  strcoll can
be incredibly slow in some locales.  Do you really need non-ASCII
sort order?  If not, I'll bet that re-initdb'ing in C locale will
make a difference.

            regards, tom lane

Re: Inserting 26 million rows takes 8 hours, how to

От
"scott.marlowe"
Дата:
On 16 Sep 2003, Jose Vicente Nunez Z wrote:

> Tom,
>
>
> On Tue, 2003-09-16 at 10:44, Tom Lane wrote:
> > Jose Vicente Nunez Z <josevnz@newbreak.com> writes:
> > > For that i'm using the PostgreSQL copy tool, but the problem is that is
> > > taking almost 9 hours to load the data:
> >
> > Hmm, 889 rows/second doesn't sound that bad, considering that you've
> > given us no data about the hardware you're running on.  In particular,
> > what's your disk setup?
>
> My fault, i should provide more information about my setup.
>
> I got two ATA disks (one for PostgreSQL, the other for the data files).
> The box is a Pentium III 1.3Ghz, dual CPU with 2 GB of RAM. One disk is
> 30 GB, the other is 40GB.
>
> As you can see is not state of the art harware, but for development
> purposes i should get a decent performance (only one or two applications
> will be accessing the server to do some data mining).
>
> >
> > But at any rate, the only bottlenecks I could think of for a COPY
> > command are:
> >
> > 1. Data conversion.  This would depend on the column datatypes of the
> > table, which you didn't tell us.
> >
> create table mytable
> (
>      a         varchar(20) not null,
>      b         varchar(20) not null,
>      c         varchar(20) not null,
>      d         char(6),
>      f         int null,
>      g         float not null,
>      h         float not null
> )
>
> Data conversions shouldn't be that complicated, isn't it?
>
> > 2. Index updating.  This would depend on the indexes on the table,
> > which you didn't tell us.
>
> I had one index and dropped it for the load. Also i turn of the 'fsync'
> flag (fsync=false) and now the load time is only 15 minutes!. But then
> recreating the original index takes forever:
>
> create  index myindex on mytable ( a, b, c );

Are you sure this is the type of index you want?  Keep in mind postgresql
won't use this index for queries like:

select * from mytable where b='abc';

or

select * from mytable where a='%abc%';

but only for queries that invoke a, then b, then c, and only with left
anchored text.

select * from mytable where a='abc' and b='def';

will work.  you might want to look at using the tsearch engine just
updated for 7.4.x for this kind of thing, or just running seq scans with a
whole lot of shared_buffers to hold your data.

Cranking shared_buffers up to 1000 to 20000 or so should help.  Note that
there's a point of diminishing returns for certain queries with large
shared_buffers where buffer management costs more than having the data in
the buffer.  For what you're doing, however, larger may well be better.

Oh, and don't forget to vacuum and analyze...