Re: Better way to bulk-load millions of CSV records into postgres?

Поиск
Список
Период
Сортировка
От Marc Spitzer
Тема Re: Better way to bulk-load millions of CSV records into postgres?
Дата
Msg-id 20020522121819.A37228@oscar.eng.cv.net
обсуждение исходный текст
Ответ на Re: Better way to bulk-load millions of CSV records into postgres?  (Tom Sheehan <tomas@csi.com>)
Ответы Re: Better way to bulk-load millions of CSV records into
Список pgsql-novice
On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> Have you looked at the COPY command is psql  for this?  There are the, 'FROM
> { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> It seems to me that bulk loading is what they were designed for.
>
> ts

for very large datasets that can be a problem, it is 1 transacrion and
the transaction logs must be kept until is finishes or aborts.  This
can be a big disk hit.

If it is just a plain csv file you can use split to make 1 file into
several smaller files and load each one seperatly.

you can look at transactions and do roughly the same thing from
python, commit every 10,000 rows.

the is a varable in config files that allows you to turn off flush
to disk.  If you do that for the load you will have better load
speed.  Turn it back on when you are done with the load.

and read the admin guide twice, it will help.

marc



>
>
> ----- Original Message -----
> From: "Ron Johnson" <ron.l.johnson@cox.net>
> To: "PgSQL Novice ML" <pgsql-novice@postgresql.org>
> Sent: Tuesday, May 21, 2002 3:40 PM
> Subject: [NOVICE] Better way to bulk-load millions of CSV records into
> postgres?
>
>
> >
> > Hi,
> >
> > Currently, I've got a python script using pyPgSQL that
> > parses the CSV record, creates a string that is a big
> > "INSERT INTO VALUES (...)" command, then, execute() it.
> >
> > top shows that this method uses postmaster with ~70% CPU
> > utilization, and python with ~15% utilization.
> >
> > Still, it's only inserting ~190 recs/second.  Is there a
> > better way to do this, or am I constrained by the hardware?
> >
> > Instead of python and postmaster having to do a ton of data
> > xfer over sockets, I'm wondering if there's a way to send a
> > large number of csv records (4000, for example) in one big
> > chunk to a stored procedure and have the engine process it
> > all.
> >
> > Linux 2.4.18
> > PostgreSQL 7.2.1
> > python 2.1.3
> > csv file on /dev/hda
> > table on /dev/hde  (ATA/100)
> >
> > --
> > +---------------------------------------------------------+
> > | Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
> > | Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
> > |                                                         |
> > | "I have created a government of whirled peas..."        |
> > |   Maharishi Mahesh Yogi, 12-May-2002,                   |
> > !   CNN, Larry King Live                                  |
> > +---------------------------------------------------------+
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo@postgresql.org so that your
> > message can get through to the mailing list cleanly
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

В списке pgsql-novice по дате отправления:

Предыдущее
От: John Taylor
Дата:
Сообщение: Re: optimising data load
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pl/perl Documentation