Importing Large Amounts of Data

Поиск
Список
Период
Сортировка
От Curt Sampson
Тема Importing Large Amounts of Data
Дата
Msg-id Pine.NEB.4.43.0204111721120.11530-100000@angelic.cynic.net
обсуждение исходный текст
Список pgsql-general
I've been asked by a client to do some testing of Postgres for what
appears to be OLAP on a fairly large data set (about half a billion
tuples). I'm probably going to want to try partitioning this in various
ways, but the application, not Postgres, will deal with that.

I'm using PostgreSQL 7.2.1, and the schema I'm testing with is as follows:

    CREATE TABLE bigone (
    rec_no    int PRIMARY KEY,
    day        date,
    user_id    char(5),
    value    varchar(20)
    ) WITHOUT OIDS;
    CREATE INDEX bigone_day ON bigone (day);
    CREATE INDEX bigone_user_id ON bigone (user_id);

Unfortunately, the first problem I've run into is that importing is
rather slow. With all indexes (including the bigone_pkey) dropped,
importing five million tuples into the above table, starting from empty,
takes about 921 seconds. The second 5M tuples takes about 1009 seconds.
If I switch to using the -F option, the first 5M takes 714 seconds and the
second 5M takes 742 seconds. At the end, I have about 742 MB of data under
the data/base directory. (This is using a fresh database cluster.)

For comparison, the MySQL does each import in about 221 and 304 seconds,
and the data in the end take up about 427 MB.

Part of the problem here may be that Postgres appears to be logging the
COPY operation; I get from 27-33 "recycled transaction log file" messages
for every 5M tuple COPY that I do. If there were a way to do an unlogged
copy, that might save close to half a gig of writes to the disk.

The other part of the problem may just be the size of the data;
why does Postgres take up 75% more space (about 78 bytes per tuple,
vs. 45 bytes per tuple) for this table?

As well, index builds seem to take about 20% longer (using -F), and they
seem to be about 10% larger as well.

Does anybody have any suggestions as to how I can improve performance
here, and reduce disk space requirements? If not, I'll probably have
to suggest to the client that he move to MySQL for this particular
application, unless he needs any of the features that Postgres provides
and MySQL doesn't.

--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


В списке pgsql-general по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Multiply and Divide operators for geometry types
Следующее
От: "Papp, Gyozo"
Дата:
Сообщение: Re: SPI_execp() failed in RI_FKey_cascade_del()