Обсуждение: COPY FROM STDIN instead of INSERT
Hello! One important use case in my libpq based application (PostgreSQL 8.1.4) is a sort of massive data loading. Currently it is implemented as a series of plain normal INSERTs (binary form of PQexecParams is used) and the problem here it is pretty slow. I've tried to play with batches and with peculiar constructions like INSERT (SELECT .. UNION ALL SELECT ..) to improve performance, but not satisfied with the result I've got. Now I try to figure out if it is possible to use COPY FROM STDIN instead of INSERT if I have to insert, say, more then 100records at once. Hints are highly appreciated. The only limitaion mentioned in Manual is about Rules and I don't care about this since I don't use Rules. Am I going to come across with any other problems (concurrency, reliability, compatibility, whatever) on this way? Many thanks. -- Best regards Ilja Golshtein
Ilja Golshtein wrote: > Hello! > > One important use case in my libpq based application (PostgreSQL 8.1.4) is a sort of massive data loading. > > Currently it is implemented as a series of plain normal INSERTs > (binary form of PQexecParams is used) and the problem here it is pretty slow. > > I've tried to play with batches and with peculiar constructions > like INSERT (SELECT .. UNION ALL SELECT ..) to improve performance, but not satisfied with the result I've got. > > Now I try to figure out if it is possible to use COPY FROM STDIN instead of INSERT if I have to insert, say, more then100 records at once. > > Hints are highly appreciated. > > The only limitaion mentioned in Manual is about Rules and I don't care about this since I don't use Rules. > Am I going to come across with any other problems (concurrency, reliability, compatibility, whatever) on this way? > > Many thanks. > Using COPY FROM STDIN is much faster than INSERT's (I am sure some out there have test times to compare, I don't have any on hand) Sounds like your working with an existing database - if you are starting from scratch (inserting data into an empty database) then there are other things that can help too. -- Shane Ambler Postgres@007Marketing.com Get Sheeky @ http://Sheeky.Biz
Hello! >Using COPY FROM STDIN is much faster than INSERT's (I am sure some out >there have test times to compare, I don't have any on hand) Yes, I know it is much faster. The question is about possible pay for this quickness. What if COPY, say, locks index until end of transaction (it is just an example, of course)? Such things are not so easy todiscover during design or even test time. >Sounds like your working with an existing database - if you are starting >from scratch (inserting data into an empty database) then there are >other things that can help too. I am working with existing database, though I am interested what "other things" you mean. Many thanks. -- Best regards Ilja Golshtein
Ilja Golshtein wrote: >> Sounds like your working with an existing database - if you are starting >>from scratch (inserting data into an empty database) then there are >> other things that can help too. > > I am working with existing database, though I am interested what "other things" you mean. > Basically when adding data to a table some of the time spent inserting is spent updating indexes. When starting a database from scratch it is much faster to import the data and then create the indexes. The time to create index on a full table is less than the extra time from each index update from the inserts. The more indexes to update the more time updating indexes takes. The problem with a live database is removing the indexes slows down current users and if you are adding 2,000 rows to a table that already has 5,000,000 rows in it then you will loose the benefit. -- Shane Ambler Postgres@007Marketing.com Get Sheeky @ http://Sheeky.Biz
>When starting a database from scratch it is much faster to import the >data and then create the indexes. The time to create index on a full >table is less than the extra time from each index update from the >inserts. The more indexes to update the more time updating indexes takes. > >The problem with a live database is removing the indexes slows down >current users and if you are adding 2,000 rows to a table that already >has 5,000,000 rows in it then you will loose the benefit. I am 100% agree with you. What you are describing is a very good and useful technique for some maintenance operations. My current goal is to increase performance in normal [almost ;)] OLTP mode of my application, so removing indexes for sometime is not an option here. And my question remains. Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs? -- Best regards Ilja Golshtein
On Wed, Oct 18, 2006 at 04:20:41PM +0400, Ilja Golshtein wrote: > And my question remains. > Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs? The reason why copy is faster is because it doesn't have to parse/plan/execute all the queries. In exchange you can't use expressions or joins to fill the table, only raw data. Binary may be slightly faster because the datum parsing can be partially skipped, but that's hardly much benefit over a text copy. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
>The reason why copy is faster is because it doesn't have to >parse/plan/execute all the queries. In exchange you can't use >expressions or joins to fill the table, only raw data. In other words, COPY has no hidden catches, and I should go with it and don't worry. Correct interpretation? ;) >Binary may be slightly faster because the datum parsing can be >partially skipped, but that's hardly much benefit over a text copy. I know example where it is up to three times faster. It depends on data. -- Best regards Ilja Golshtein
On 10/18/06, Martijn van Oosterhout <kleptog@svana.org> wrote: > Binary may be slightly faster because the datum parsing can be > partially skipped, but that's hardly much benefit over a text copy. I tested binary quite a bit and only found it to be a win if moving blobs in and out of the database. On 'normal' tables of mixed fields types of small size, it can actually be slower. Binary is a bit faster for native types and bytea, and slower for character types. merlin
On 10/18/06, Ilja Golshtein <ilejn@yandex.ru> wrote: > I've tried to play with batches and with peculiar constructions > like INSERT (SELECT .. UNION ALL SELECT ..) to improve performance, but not satisfied with the result I've got. postgresql 8.2 (beta) supports the 'multiple insert' syntax, so you can insert multiple rows in an insert statement without using 'union all'. it's pretty fast, although not as fast as copy. btw, if you have a lot of indexes on your table, the input method is not so important. aside: new insert syntax has one very nice side effect. assuming the table(s) are already defined, I can transfer data from mysql to postgresql via: mysqldump --compatible=postgresql esilo | grep INSERT | psql without any extra processing for most cases. gotta give some points to mysql for adding postgresql compatibility which sorta works. merlin
Ilja Golshtein wrote: > And my question remains. > Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs? > If it does what you want then it is OK to use it. -- Shane Ambler Postgres@007Marketing.com Get Sheeky @ http://Sheeky.Biz
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/18/06 08:03, Merlin Moncure wrote: > On 10/18/06, Martijn van Oosterhout <kleptog@svana.org> wrote: >> Binary may be slightly faster because the datum parsing can be >> partially skipped, but that's hardly much benefit over a text copy. > > I tested binary quite a bit and only found it to be a win if moving > blobs in and out of the database. On 'normal' tables of mixed fields > types of small size, it can actually be slower. Binary is a bit > faster for native types and bytea, and slower for character types. "native types"? - -- Ron Johnson, Jr. Jefferson LA USA Is "common sense" really valid? For example, it is "common sense" to white-power racists that whites are superior to blacks, and that those with brown skins are mud people. However, that "common sense" is obviously wrong. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFNjVcS9HxQb37XmcRAvuIAJ4jHzT3dqwTCs6jrQMrPabz6yDo3gCZAaRz smE6g1Yig973CLUhUX8CEc8= =lFRM -----END PGP SIGNATURE-----
On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote: > > I tested binary quite a bit and only found it to be a win if moving > > blobs in and out of the database. On 'normal' tables of mixed fields > > types of small size, it can actually be slower. Binary is a bit > > faster for native types and bytea, and slower for character types. > > "native types"? types operated on directly by the processor. int2, int4, int8, float4, and float8, and their various aliases :). in short, i think using binary for anything other than bytea is a waste of effort/time, except for bytea. merlin
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/18/06 09:47, Merlin Moncure wrote: > On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote: >> > I tested binary quite a bit and only found it to be a win if moving >> > blobs in and out of the database. On 'normal' tables of mixed fields >> > types of small size, it can actually be slower. Binary is a bit >> > faster for native types and bytea, and slower for character types. >> >> "native types"? > > types operated on directly by the processor. int2, int4, int8, float4, > and float8, and their various aliases :). > > in short, i think using binary for anything other than bytea is a > waste of effort/time, except for bytea. That's counter-intuitive, since you'd (well, I'd) think that doing a binary copy would be faster since the code would bypass the int-to- ascii conversion. - -- Ron Johnson, Jr. Jefferson LA USA Is "common sense" really valid? For example, it is "common sense" to white-power racists that whites are superior to blacks, and that those with brown skins are mud people. However, that "common sense" is obviously wrong. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFNkRyS9HxQb37XmcRAn4tAJ4xRFmA/T82/iFi4O+sfvBGk4Y+EgCfVjn0 CFs2nT9w6RxTj8dV5C4kBUk= =2RRX -----END PGP SIGNATURE-----
Ron Johnson wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 10/18/06 09:47, Merlin Moncure wrote: > > On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote: > >> > I tested binary quite a bit and only found it to be a win if moving > >> > blobs in and out of the database. On 'normal' tables of mixed fields > >> > types of small size, it can actually be slower. Binary is a bit > >> > faster for native types and bytea, and slower for character types. > >> > >> "native types"? > > > > types operated on directly by the processor. int2, int4, int8, float4, > > and float8, and their various aliases :). > > > > in short, i think using binary for anything other than bytea is a > > waste of effort/time, except for bytea. > > That's counter-intuitive, since you'd (well, I'd) think that doing a > binary copy would be faster since the code would bypass the int-to- > ascii conversion. Yeah, but on the other hand it has to do the htonl/ntohl conversion. (I'd guess that should be faster than the text-to-int anyway ...) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 10/18/06 09:47, Merlin Moncure wrote: > > On 10/18/06, Ron Johnson <ron.l.johnson@cox.net> wrote: > >> > I tested binary quite a bit and only found it to be a win if moving > >> > blobs in and out of the database. On 'normal' tables of mixed fields > >> > types of small size, it can actually be slower. Binary is a bit > >> > faster for native types and bytea, and slower for character types. > >> > >> "native types"? > > > > types operated on directly by the processor. int2, int4, int8, float4, > > and float8, and their various aliases :). > > > > in short, i think using binary for anything other than bytea is a > > waste of effort/time, except for bytea. > > That's counter-intuitive, since you'd (well, I'd) think that doing a > binary copy would be faster since the code would bypass the int-to- > ascii conversion. you missed the point: binary copy is (very marginally) faster for 'native types' aka ints, etc. however text fields are slower according to my testing. however, the speed differences are extremely marginal overall. only exception to this is binary (bytea) fields...you get a 2x speed improvement easily due to skipping the encoding. merlin
On Oct 18, 2006, at 5:20 AM, Ilja Golshtein wrote: >> When starting a database from scratch it is much faster to import the >> data and then create the indexes. The time to create index on a full >> table is less than the extra time from each index update from the >> inserts. The more indexes to update the more time updating indexes >> takes. >> >> The problem with a live database is removing the indexes slows down >> current users and if you are adding 2,000 rows to a table that >> already >> has 5,000,000 rows in it then you will loose the benefit. > > I am 100% agree with you. What you are describing is a very good > and useful technique for some maintenance operations. > > My current goal is to increase performance in normal [almost ;)] > OLTP mode of my application, so removing indexes for some time is > not an option here. > > And my question remains. > Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs? I don't think I would use BINARY, it seems likely to be susceptible to changes in the underlying data type storage. From the docs: "To determine the appropriate binary format for the actual tuple data you should consult the PostgreSQL source, in particular the *send and *recv functions for each column's data type (typically these functions are found in the src/backend/utils/adt/ directory of the source distribution)." Regular text COPY is quite a big performance win over INSERTs, and doesn't require your application to know the vagaries of the data storage. Also, if you have many indices, time to update them will probably dominate anyhow, making the difference between binary and text copy negligible. A was mentioned, COPY can only insert static data, and does not support rules (that you might use to support constraint exclusion, etc). AFIAK, the locking semantics are the same as INSERT, i.e., it does not lock the entire table or anything like that. Since it would allow transactions that insert data to finish faster, it should actually work better under high concurrency. -Casey
>> Is it Ok to use COPY BINARY FROM STDIN instead of multiple INSERTs? >I don't think I would use BINARY, it seems likely to be susceptible >to changes in the underlying data type storage. From the docs: >"To determine the appropriate binary format for the actual tuple data >you should consult the PostgreSQL source [skipped] Yes, it's a problem, though it's solved already. I use BINARY for, say, initial data loading. >AFIAK, the locking semantics are the same as INSERT, i.e., it >does not lock the entire table or anything like that. Since it would >allow transactions that insert data to finish faster, it should >actually work better under high concurrency. Do hope that's right. Thanks. -- Best regards Ilja Golshtein