Обсуждение: Issues with \copy from file

Поиск
Список
Период
Сортировка

Issues with \copy from file

От
Sigurgeir Gunnarsson
Дата:
I'm doing \copy from file into table. There are two files one with 7 million lines and the other with around 24 million and the data goes into separate table. There are only three columns in each file and four in each table (the primary key, id serial is the fourt). The data is about 150 MB and 450 MB and takes from 5 to 20 minutes to load into the database.

What I'm wondering about is what parameters to tweak to improve the operation and shorten the time of the \copy ? I think I have tweaked most of the available in postgresql.conf, that is shared_buffer, temp_buffers, work_mem, maintenance_work_mem, max_fsm_pages. Maybe someone could point out the one really related to \copy ?

I would hope that there is some way for me to improve the operation. I used to do the operation on a MySQL server and simple time measurements gives me a difference of a multiple 3 to 4, where the MySQL is faster.

I would also be satisfied to know if this is an expected difference.

Regards, Sigurgeir

Re: Issues with \copy from file

От
Euler Taveira de Oliveira
Дата:
Sigurgeir Gunnarsson escreveu:
> What I'm wondering about is what parameters to tweak to improve the
> operation and shorten the time of the \copy ? I think I have tweaked
> most of the available in postgresql.conf, that is shared_buffer,
> temp_buffers, work_mem, maintenance_work_mem, max_fsm_pages. Maybe
> someone could point out the one really related to \copy ?
>
You don't show us your table definitions. You don't say what postgresql
version you're using. Let's suppose archiving is disabled, you're bulk loading
table foo and, you're using version >= 8.3. Just do:

BEGIN;
TRUNCATE TABLE foo;
COPY foo FROM ...;
COMMIT;

PostgreSQL will skip WAL writes and just fsync() the table at the end of the
command.

Also, take a look at [1].

[1] http://www.postgresql.org/docs/current/interactive/populate.html


--
  Euler Taveira de Oliveira
  http://www.timbira.com/

Re: Issues with \copy from file

От
Scott Marlowe
Дата:
On Mon, Oct 12, 2009 at 4:05 PM, Sigurgeir Gunnarsson
<sgunnars@gmail.com> wrote:
> I'm doing \copy from file into table. There are two files one with 7 million
> lines and the other with around 24 million and the data goes into separate
> table. There are only three columns in each file and four in each table (the
> primary key, id serial is the fourt). The data is about 150 MB and 450 MB
> and takes from 5 to 20 minutes to load into the database.

You can only write data then commit it so fast to one drive, and that
speed is usually somewhere in the megabyte per second range.  450+150
in 5 minutes is 120 Megs per second, that's pretty fast, but is likely
the max speed of a modern super fast 15k rpm drive.  If it's taking 20
minutes then it's 30 Megs per second which is still really good if
you're in the middle of a busy afternoon and the db has other things
to do.

The key here is monitoring your machine to see what you're maxing out.
 If you're at 100% IO then cpu tricks and tuning aren't likely to
help.  Unless you can reduce the IO load to do the same thing (things
like turning off fsync might help streamline some writes.)

To really tell what the numbers bi / bo / wa mean you really need to
run some artificial tests to see what your machine can do at
quiescence. If you can get 120Meg per second streamed, and 20 Meg per
second random on 8k blocks, then 5 minutes is the top side of what you
can ever expect to get.  If you can get 600Meg per sec then you're
there yet, and might need multiple threads to load data fast.

pg_restore supports the -j switch for this.  But it only works on
separate tables so you'd be limited to two at once right now since
there's two tables.

> What I'm wondering about is what parameters to tweak to improve the
> operation and shorten the time of the \copy ?

copy has a minimum cost in time per megabyte that you can't get out
of.   The trick is knowing when you've gotten there (or damned close)
and quit banging your head on the wall about it.


> I think I have tweaked most of
> the available in postgresql.conf, that is shared_buffer, temp_buffers,
> work_mem, maintenance_work_mem, max_fsm_pages. Maybe someone could point out
> the one really related to \copy ?

Try cranking up your checkpoint segments to several hundred.  Note
this may delay  restart on a crash.  If you crash a lot you have other
problems, but still, it lets you know that if someone trips over a
cord in the afternoon you're gonna have to wait 10 or 20 or 30 minutes
for the machine to come back up as it replays the log files.


> I would hope that there is some way for me to improve the operation. I used
> to do the operation on a MySQL server and simple time measurements gives me
> a difference of a multiple 3 to 4, where the MySQL is faster.

With innodb tables?  If it's myisam tables it doesn't really count,
unless your data is unimportant.  In which case myisam may be the
better choice.

> I would also be satisfied to know if this is an expected difference.

I'm not entirely sure it's a difference really.  I can believe one if
I see it on my hardware, where I run both dbs.  Pgsql is much faster
on my machines that mysql for this type of stuff.

Note that reading the file from the same file system that you're
writing to is gonna be slow.  It'd likely be fastest to read from one
drive that is FAST but not the main storage drive.

Next, are pg_xlog files on the same partition as the main db?

I can copy files into my big servers in teh 350 to 450
megabytes/second range if the machines are otherwise quiet (early am)
and sustain 150 to 200 even during moderately high loads during the
day.

Re: Issues with \copy from file

От
Matthew Wakeling
Дата:
On Sun, 18 Oct 2009, Scott Marlowe wrote:
> You can only write data then commit it so fast to one drive, and that
> speed is usually somewhere in the megabyte per second range.  450+150
> in 5 minutes is 120 Megs per second, that's pretty fast, but is likely
> the max speed of a modern super fast 15k rpm drive.  If it's taking 20
> minutes then it's 30 Megs per second which is still really good if
> you're in the middle of a busy afternoon and the db has other things
> to do.

You're out by a factor of 60. That's minutes, not seconds.

More relevant is the fact that Postgres will normally log changes in the
WAL, effectively writing the data twice. As Euler said, the trick is to
tell Postgres that noone else will need to see the data, so it can skip
the WAL step:

> BEGIN;
> TRUNCATE TABLE foo;
> COPY foo FROM ...;
> COMMIT;

I see upward of 100MB/s over here when I do this.

Matthew

--
 Patron: "I am looking for a globe of the earth."
 Librarian: "We have a table-top model over here."
 Patron: "No, that's not good enough. Don't you have a life-size?"
 Librarian: (pause) "Yes, but it's in use right now."

Re: Issues with \copy from file

От
Sigurgeir Gunnarsson
Дата:
I hope the issue is still open though I haven't replied to it before.

Euler mentioned that I did not provide any details about my system. I'm using version 8.3 and with most settings default on an old machine with 2 GB of mem. The table definition is simple, four columns; id, value, x, y where id is primary key and x, y are combined into an index.

I'm not sure if it matters but unlike Euler's suggestion I'm using \copy instead of COPY. Regarding my comparison to MySQL, it is completely valid. This is done on the same computer, using the same disk on the same platform. From that I would derive that IO is not my problem, unless postgresql is doing IO twice while MySQL only once.

I guess my tables are InnoDB since that is the default type (or so I think). BEGIN/COMMIT I did not find change much. Are there any other suggestions ?

My postgres.conf:
#------------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#------------------------------------------------------------------------------

# - Memory -
shared_buffers = 16MB                   # min 128kB or max_connections*16kB
temp_buffers = 16MB                     # min 800kB
#max_prepared_transactions = 5          # can be 0 or more
# Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
work_mem = 128MB                        # min 64kB
maintenance_work_mem = 128MB            # min 1MB
#max_stack_depth = 2MB                  # min 100kB

# - Free Space Map -
max_fsm_pages = 2097152                 # min max_fsm_relations*16, 6 bytes each
max_fsm_relations = 500                 # min 100, ~70 bytes each

# - Kernel Resource Usage -
#max_files_per_process = 1000           # min 25
#shared_preload_libraries = ''          # (change requires restart)
# - Cost-Based Vacuum Delay -
#vacuum_cost_delay = 0                  # 0-1000 milliseconds
#vacuum_cost_page_hit = 1               # 0-10000 credits
#vacuum_cost_page_miss = 10             # 0-10000 credits
#vacuum_cost_page_dirty = 20            # 0-10000 credits
#vacuum_cost_limit = 200                # 1-10000 credits

#------------------------------------------------------------------------------
# WRITE AHEAD LOG
#------------------------------------------------------------------------------

# - Settings -
#fsync = on                             # turns forced synchronization on or off
#synchronous_commit = on                # immediate fsync at commit
#wal_sync_method = fsync                # the default is the first option
#full_page_writes = on                  # recover from partial page writes
#wal_buffers = 64kB                     # min 32kB
#wal_writer_delay = 200ms               # 1-10000 milliseconds
#commit_delay = 0                       # range 0-100000, in microseconds
#commit_siblings = 5                    # range 1-1000

# - Checkpoints -
checkpoint_segments = 64                # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min              # range 30s-1h
#checkpoint_completion_target = 0.9     # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s               # 0 is off

# - Archiving -
#archive_mode = off             # allows archiving to be done
#archive_command = ''           # command to use to archive a logfile segment
#archive_timeout = 0            # force a logfile segment switch after this

#------------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#------------------------------------------------------------------------------

autovacuum = on                         # Enable autovacuum subprocess?  'on'


2009/10/19 Matthew Wakeling <matthew@flymine.org>
On Sun, 18 Oct 2009, Scott Marlowe wrote:
You can only write data then commit it so fast to one drive, and that
speed is usually somewhere in the megabyte per second range.  450+150
in 5 minutes is 120 Megs per second, that's pretty fast, but is likely
the max speed of a modern super fast 15k rpm drive.  If it's taking 20
minutes then it's 30 Megs per second which is still really good if
you're in the middle of a busy afternoon and the db has other things
to do.

You're out by a factor of 60. That's minutes, not seconds.

More relevant is the fact that Postgres will normally log changes in the WAL, effectively writing the data twice. As Euler said, the trick is to tell Postgres that noone else will need to see the data, so it can skip the WAL step:


BEGIN;
TRUNCATE TABLE foo;
COPY foo FROM ...;
COMMIT;

I see upward of 100MB/s over here when I do this.

Matthew

--
Patron: "I am looking for a globe of the earth."
Librarian: "We have a table-top model over here."
Patron: "No, that's not good enough. Don't you have a life-size?"
Librarian: (pause) "Yes, but it's in use right now."


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Issues with \copy from file

От
Robert Haas
Дата:
On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson
<sgunnars@gmail.com> wrote:
> I hope the issue is still open though I haven't replied to it before.
>
> Euler mentioned that I did not provide any details about my system. I'm
> using version 8.3 and with most settings default on an old machine with 2 GB
> of mem. The table definition is simple, four columns; id, value, x, y where
> id is primary key and x, y are combined into an index.
>
> I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
> instead of COPY. Regarding my comparison to MySQL, it is completely valid.
> This is done on the same computer, using the same disk on the same platform.
> From that I would derive that IO is not my problem, unless postgresql is
> doing IO twice while MySQL only once.
>
> I guess my tables are InnoDB since that is the default type (or so I think).
> BEGIN/COMMIT I did not find change much. Are there any other suggestions ?

Did you read Matthew Wakeling's reply?  Arranging to skip WAL will
help a lot here.  To do that, you need to either create or truncate
the table in the same transaction that does the COPY.

The problem with the MySQL comparison is that it's not really
relevant.   It isn't that the PostgreSQL code just sucks and if we
wrote it properly it would be as fast as MySQL.  If that were the
case, everyone would be up in arms, and it would have been fixed long
ago.  Rather, the problem is almost certainly that it's not an
apples-to-apples comparison.  MySQL is probably doing something
different, such as perhaps not properly arranging for recovery if the
system goes down in the middle of the copy, or just after it
completes.  But I don't know MySQL well enough to know exactly what
the difference is, and I'm not particularly interested in spending a
lot of time figuring it out.  I think you'll get that reaction from
others on this list as well, but of course that's up to them.
Everybody here is a volunteer, of course, and generally our interest
is principally PostgreSQL.

On the other hand, we can certainly give you lots of information about
what PostgreSQL is doing and why that takes the amount of time that it
does, or give you information on how you can find out more about what
it's doing.

...Robert

Re: Issues with \copy from file

От
Sigurgeir Gunnarsson
Дата:
The intention was never to talk down postgresql but rather trying to get some explanation of this difference so that I could do the proper changes.

After having read the link from Euler's post, which I oversaw, I have managed to shorten the import time. My problem was with the indexes. I was able to shorten the import time, of a 26 million line import, from 2 hours + (I gave up after that time) downto 12 minutes by dropping the indexes after truncate and before copy.

This is what I was expecting and I'm therefore satisfied with the result.

Regards, Sigurgeir

2009/12/18 Robert Haas <robertmhaas@gmail.com>
On Fri, Dec 18, 2009 at 7:46 AM, Sigurgeir Gunnarsson
<sgunnars@gmail.com> wrote:
> I hope the issue is still open though I haven't replied to it before.
>
> Euler mentioned that I did not provide any details about my system. I'm
> using version 8.3 and with most settings default on an old machine with 2 GB
> of mem. The table definition is simple, four columns; id, value, x, y where
> id is primary key and x, y are combined into an index.
>
> I'm not sure if it matters but unlike Euler's suggestion I'm using \copy
> instead of COPY. Regarding my comparison to MySQL, it is completely valid.
> This is done on the same computer, using the same disk on the same platform.
> From that I would derive that IO is not my problem, unless postgresql is
> doing IO twice while MySQL only once.
>
> I guess my tables are InnoDB since that is the default type (or so I think).
> BEGIN/COMMIT I did not find change much. Are there any other suggestions ?

Did you read Matthew Wakeling's reply?  Arranging to skip WAL will
help a lot here.  To do that, you need to either create or truncate
the table in the same transaction that does the COPY.

The problem with the MySQL comparison is that it's not really
relevant.   It isn't that the PostgreSQL code just sucks and if we
wrote it properly it would be as fast as MySQL.  If that were the
case, everyone would be up in arms, and it would have been fixed long
ago.  Rather, the problem is almost certainly that it's not an
apples-to-apples comparison.  MySQL is probably doing something
different, such as perhaps not properly arranging for recovery if the
system goes down in the middle of the copy, or just after it
completes.  But I don't know MySQL well enough to know exactly what
the difference is, and I'm not particularly interested in spending a
lot of time figuring it out.  I think you'll get that reaction from
others on this list as well, but of course that's up to them.
Everybody here is a volunteer, of course, and generally our interest
is principally PostgreSQL.

On the other hand, we can certainly give you lots of information about
what PostgreSQL is doing and why that takes the amount of time that it
does, or give you information on how you can find out more about what
it's doing.

...Robert

Re: Issues with \copy from file

От
Robert Haas
Дата:
On Fri, Dec 18, 2009 at 10:51 AM, Sigurgeir Gunnarsson
<sgunnars@gmail.com> wrote:
> The intention was never to talk down postgresql but rather trying to get
> some explanation of this difference so that I could do the proper changes.
>
> After having read the link from Euler's post, which I oversaw, I have
> managed to shorten the import time. My problem was with the indexes. I was
> able to shorten the import time, of a 26 million line import, from 2 hours +
> (I gave up after that time) downto 12 minutes by dropping the indexes after
> truncate and before copy.
>
> This is what I was expecting and I'm therefore satisfied with the result.

Ah ha!  Well, it sounds like perhaps you have the answer to what was
causing the difference too, then.  I'm not trying to be unhelpful,
just trying to explain honestly why you might not get exactly the
response you expect to MySQL comparisons - we only understand half of
it.

...Robert