Обсуждение: COPY FROM STDIN instead of INSERT

Поиск
Список
Период
Сортировка

COPY FROM STDIN instead of INSERT

От
"Ilja Golshtein"
Дата:
Hello!

One important use case in my libpq based application (PostgreSQL 8.1.4) is a sort of massive data loading.

Currently it is implemented as a series of plain normal INSERTs
(binary form of PQexecParams is used) and the problem here it is pretty slow.

I've tried to play with batches and with peculiar constructions
like INSERT (SELECT .. UNION ALL SELECT ..) to improve performance, but not satisfied with the result I've got.

Now I try to figure out if it is possible to use COPY FROM STDIN instead of INSERT if I have to insert, say, more then
100records at once.  

Hints are highly appreciated.

The only limitaion mentioned in Manual is about Rules and I don't care about this since I don't use Rules.
Am I going to come across with any other problems (concurrency, reliability, compatibility, whatever) on this way?

Many thanks.

--
Best regards
Ilja Golshtein

Re: COPY FROM STDIN instead of INSERT

От
Shane Ambler
Дата:
Ilja Golshtein wrote:
> Hello!
>
> One important use case in my libpq based application (PostgreSQL 8.1.4) is a sort of massive data loading.
>
> Currently it is implemented as a series of plain normal INSERTs
> (binary form of PQexecParams is used) and the problem here it is pretty slow.
>
> I've tried to play with batches and with peculiar constructions
> like INSERT (SELECT .. UNION ALL SELECT ..) to improve performance, but not satisfied with the result I've got.
>
> Now I try to figure out if it is possible to use COPY FROM STDIN instead of INSERT if I have to insert, say, more
then100 records at once.  
>
> Hints are highly appreciated.
>
> The only limitaion mentioned in Manual is about Rules and I don't care about this since I don't use Rules.
> Am I going to come across with any other problems (concurrency, reliability, compatibility, whatever) on this way?
>
> Many thanks.
>
Using COPY FROM STDIN is much faster than INSERT's (I am sure some out
there have test times to compare, I don't have any on hand)


Sounds like your working with an existing database - if you are starting
from scratch (inserting data into an empty database) then there are
other things that can help too.


--

Shane Ambler
Postgres@007Marketing.com

Get Sheeky @ http://Sheeky.Biz

Re: COPY FROM STDIN instead of INSERT

От
"Ilja Golshtein"
Дата:
Hello!

>Using COPY FROM STDIN is much faster than INSERT's (I am sure some out
>there have test times to compare, I don't have any on hand)

Yes, I know it is much faster.

The question is about possible pay for this quickness.

What if COPY, say, locks index until end of transaction (it is just an example, of course)? Such things are not so easy
todiscover during design or even test time. 

>Sounds like your working with an existing database - if you are starting
>from scratch (inserting data into an empty database) then there are
>other things that can help too.

I am working with existing database, though I am interested what "other things" you mean.

Many thanks.

--
Best regards
Ilja Golshtein

Re: COPY FROM STDIN instead of INSERT

От
Shane Ambler
Дата:
Ilja Golshtein wrote:

>> Sounds like your working with an existing database - if you are starting
>>from scratch (inserting data into an empty database) then there are
>> other things that can help too.
>
> I am working with existing database, though I am interested what "other things" you mean.
>
Basically when adding data to a table some of the time spent inserting
is spent updating indexes.

When starting a database from scratch it is much faster to import the
data and then create the indexes. The time to create index on a full
table is less than the extra time from each index update from the
inserts. The more indexes to update the more time updating indexes takes.

The problem with a live database is removing the indexes slows down
current users and if you are adding 2,000 rows to a table that already
has 5,000,000 rows in it then you will loose the benefit.


--

Shane Ambler
Postgres@007Marketing.com

Get Sheeky @ http://Sheeky.Biz

Re: COPY FROM STDIN instead of INSERT

От
"Ilja Golshtein"
Дата:
>When starting a database from scratch it is much faster to import the
>data and then create the indexes. The time to create index on a full
>table is less than the extra time from each index update from the
>inserts. The more indexes to update the more time updating indexes takes.
>
>The problem with a live database is removing the indexes slows down
>current users and if you are adding 2,000 rows to a table that already
>has 5,000,000 rows in it then you will loose the benefit.

I am 100% agree with you. What you are describing is a very good and useful technique for some maintenance operations.

My current goal is to increase performance in normal [almost ;)] OLTP mode of my application, so removing indexes for
sometime is not an option here.  

And my question remains.
Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs?

--
Best regards
Ilja Golshtein

Re: COPY FROM STDIN instead of INSERT

От
Martijn van Oosterhout
Дата:
On Wed, Oct 18, 2006 at 04:20:41PM +0400, Ilja Golshtein wrote:
> And my question remains.
> Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs?

The reason why copy is faster is because it doesn't have to
parse/plan/execute all the queries. In exchange you can't use
expressions or joins to fill the table, only raw data.

Binary may be slightly faster because the datum parsing can be
partially skipped, but that's hardly much benefit over a text copy.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: COPY FROM STDIN instead of INSERT

От
"Ilja Golshtein"
Дата:
>The reason why copy is faster is because it doesn't have to
>parse/plan/execute all the queries. In exchange you can't use
>expressions or joins to fill the table, only raw data.

In other words, COPY has no hidden catches, and I should go with it and don't worry.
Correct interpretation? ;)

>Binary may be slightly faster because the datum parsing can be
>partially skipped, but that's hardly much benefit over a text copy.

I know example where it is up to three times faster.
It depends on data.

--
Best regards
Ilja Golshtein

Re: COPY FROM STDIN instead of INSERT

От
"Merlin Moncure"
Дата:
On 10/18/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
> Binary may be slightly faster because the datum parsing can be
> partially skipped, but that's hardly much benefit over a text copy.

I tested binary quite a bit and only found it to be a win if moving
blobs in and out of the database.  On 'normal' tables of mixed fields
types of small size, it can actually be slower.  Binary is a bit
faster for native types and bytea, and slower for character types.

merlin

Re: COPY FROM STDIN instead of INSERT

От
"Merlin Moncure"
Дата:
On 10/18/06, Ilja Golshtein <ilejn@yandex.ru> wrote:
> I've tried to play with batches and with peculiar constructions
> like INSERT (SELECT .. UNION ALL SELECT ..) to improve performance, but not satisfied with the result I've got.

postgresql 8.2 (beta) supports the 'multiple insert' syntax, so you
can insert multiple rows in an insert statement without using 'union
all'. it's pretty fast, although not as fast as copy.  btw, if you
have a lot of indexes on your table, the input method is not so
important.

aside: new insert syntax has one very nice side effect.  assuming the
table(s) are already defined, I can transfer data  from mysql to
postgresql via:

mysqldump --compatible=postgresql esilo | grep INSERT | psql

without any extra processing for most cases.  gotta give some points
to mysql for adding postgresql compatibility which sorta works.

merlin

Re: COPY FROM STDIN instead of INSERT

От
Shane Ambler
Дата:
Ilja Golshtein wrote:

> And my question remains.
> Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs?
>
If it does what you want then it is OK to use it.

--

Shane Ambler
Postgres@007Marketing.com

Get Sheeky @ http://Sheeky.Biz

Re: COPY FROM STDIN instead of INSERT

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/18/06 08:03, Merlin Moncure wrote:
> On 10/18/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
>> Binary may be slightly faster because the datum parsing can be
>> partially skipped, but that's hardly much benefit over a text copy.
>
> I tested binary quite a bit and only found it to be a win if moving
> blobs in and out of the database.  On 'normal' tables of mixed fields
> types of small size, it can actually be slower.  Binary is a bit
> faster for native types and bytea, and slower for character types.

"native types"?

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFNjVcS9HxQb37XmcRAvuIAJ4jHzT3dqwTCs6jrQMrPabz6yDo3gCZAaRz
smE6g1Yig973CLUhUX8CEc8=
=lFRM
-----END PGP SIGNATURE-----

Re: COPY FROM STDIN instead of INSERT

От
"Merlin Moncure"
Дата:
On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote:
> > I tested binary quite a bit and only found it to be a win if moving
> > blobs in and out of the database.  On 'normal' tables of mixed fields
> > types of small size, it can actually be slower.  Binary is a bit
> > faster for native types and bytea, and slower for character types.
>
> "native types"?

types operated on directly by the processor. int2, int4, int8, float4,
and float8, and their various aliases :).

in short, i think using binary for anything other than bytea is a
waste of effort/time, except for bytea.

merlin

Re: COPY FROM STDIN instead of INSERT

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/18/06 09:47, Merlin Moncure wrote:
> On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote:
>> > I tested binary quite a bit and only found it to be a win if moving
>> > blobs in and out of the database.  On 'normal' tables of mixed fields
>> > types of small size, it can actually be slower.  Binary is a bit
>> > faster for native types and bytea, and slower for character types.
>>
>> "native types"?
>
> types operated on directly by the processor. int2, int4, int8, float4,
> and float8, and their various aliases :).
>
> in short, i think using binary for anything other than bytea is a
> waste of effort/time, except for bytea.

That's counter-intuitive, since you'd (well, I'd) think that doing a
binary copy would be faster since the code would bypass the int-to-
ascii conversion.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFNkRyS9HxQb37XmcRAn4tAJ4xRFmA/T82/iFi4O+sfvBGk4Y+EgCfVjn0
CFs2nT9w6RxTj8dV5C4kBUk=
=2RRX
-----END PGP SIGNATURE-----

Re: COPY FROM STDIN instead of INSERT

От
Alvaro Herrera
Дата:
Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 10/18/06 09:47, Merlin Moncure wrote:
> > On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote:
> >> > I tested binary quite a bit and only found it to be a win if moving
> >> > blobs in and out of the database.  On 'normal' tables of mixed fields
> >> > types of small size, it can actually be slower.  Binary is a bit
> >> > faster for native types and bytea, and slower for character types.
> >>
> >> "native types"?
> >
> > types operated on directly by the processor. int2, int4, int8, float4,
> > and float8, and their various aliases :).
> >
> > in short, i think using binary for anything other than bytea is a
> > waste of effort/time, except for bytea.
>
> That's counter-intuitive, since you'd (well, I'd) think that doing a
> binary copy would be faster since the code would bypass the int-to-
> ascii conversion.

Yeah, but on the other hand it has to do the htonl/ntohl conversion.
(I'd guess that should be faster than the text-to-int anyway ...)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: COPY FROM STDIN instead of INSERT

От
"Merlin Moncure"
Дата:
On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 10/18/06 09:47, Merlin Moncure wrote:
> > On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote:
> >> > I tested binary quite a bit and only found it to be a win if moving
> >> > blobs in and out of the database.  On 'normal' tables of mixed fields
> >> > types of small size, it can actually be slower.  Binary is a bit
> >> > faster for native types and bytea, and slower for character types.
> >>
> >> "native types"?
> >
> > types operated on directly by the processor. int2, int4, int8, float4,
> > and float8, and their various aliases :).
> >
> > in short, i think using binary for anything other than bytea is a
> > waste of effort/time, except for bytea.
>
> That's counter-intuitive, since you'd (well, I'd) think that doing a
> binary copy would be faster since the code would bypass the int-to-
> ascii conversion.

you missed the point: binary copy is (very marginally) faster for
'native types' aka ints, etc.  however text fields are slower
according to my testing.  however, the speed differences are extremely
marginal overall.  only exception to this is binary (bytea)
fields...you get a 2x speed improvement easily due to skipping the
encoding.

merlin

Re: COPY FROM STDIN instead of INSERT

От
Casey Duncan
Дата:
On Oct 18, 2006, at 5:20 AM, Ilja Golshtein wrote:

>> When starting a database from scratch it is much faster to import the
>> data and then create the indexes. The time to create index on a full
>> table is less than the extra time from each index update from the
>> inserts. The more indexes to update the more time updating indexes
>> takes.
>>
>> The problem with a live database is removing the indexes slows down
>> current users and if you are adding 2,000 rows to a table that
>> already
>> has 5,000,000 rows in it then you will loose the benefit.
>
> I am 100% agree with you. What you are describing is a very good
> and useful technique for some maintenance operations.
>
> My current goal is to increase performance in normal [almost ;)]
> OLTP mode of my application, so removing indexes for some time is
> not an option here.
>
> And my question remains.
> Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs?

I don't think I would use BINARY, it seems likely to be susceptible
to changes in the underlying data type storage. From the docs:

"To determine the appropriate binary format for the actual tuple data
you should consult the PostgreSQL source, in particular the *send and
*recv functions for each column's data type (typically these
functions are found in the src/backend/utils/adt/ directory of the
source distribution)."

Regular text COPY is quite a big performance win over INSERTs, and
doesn't require your application to know the vagaries of the data
storage. Also, if you have many indices, time to update them will
probably dominate anyhow, making the difference between binary and
text copy negligible.

A was mentioned, COPY can only insert static data, and does not
support rules (that you might use to support constraint exclusion,
etc). AFIAK, the locking semantics are the same as INSERT, i.e., it
does not lock the entire table or anything like that. Since it would
allow transactions that insert data to finish faster, it should
actually work better under high concurrency.

-Casey

Re: COPY FROM STDIN instead of INSERT

От
"Ilja Golshtein"
Дата:
>> Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs?

>I don't think I would use BINARY, it seems likely to be susceptible
>to changes in the underlying data type storage. From the docs:
>"To determine the appropriate binary format for the actual tuple data
>you should consult the PostgreSQL source

[skipped]

Yes, it's a problem, though it's solved already. I use BINARY for, say, initial data loading.


>AFIAK, the locking semantics are the same as INSERT, i.e., it
>does not lock the entire table or anything like that. Since it would
>allow transactions that insert data to finish faster, it should
>actually work better under high concurrency.

Do hope that's right.

Thanks.

--
Best regards
Ilja Golshtein