Обсуждение:

Поиск
Список
Период
Сортировка

От
Steven Rosenstein
Дата:



In our application we have tables that we regularly load with 5-10 million
records daily.  We *were* using INSERT (I know...  Still kicking ourselves
for *that* design decision), and we now converting over to COPY.  For the
sake of robustness, we are planning on breaking the entire load into chunks
of a couple hundred thousand records each. This is to constrain the amount
of data we'd have to re-process if one of the COPYs fails.

My question is, are there any advantages, drawbacks, or outright
restrictions to using multiple simultaneous COPY commands to load data into
the same table?   One issue that comes to mind is the loss of data
sequencing if we have multiple chunks interleaving records in the table at
the same time.  But from a purely technical point of view, is there any
reason why the backend would not be happy with two or more COPY commands
trying to insert data into the same table at the same time?  Does COPY take
out any locks on a table?

Thanks in advance,
--- Steve


Re:

От
Tom Lane
Дата:
Steven Rosenstein <srosenst@us.ibm.com> writes:
> My question is, are there any advantages, drawbacks, or outright
> restrictions to using multiple simultaneous COPY commands to load data into
> the same table?

It will work; not sure about whether there is any performance benefit.
I vaguely recall someone having posted about doing this, so you might
check the archives.

            regards, tom lane

Re: COPY vs INSERT

От
Mischa Sandberg
Дата:
> Steven Rosenstein <srosenst@us.ibm.com> writes:
> > My question is, are there any advantages, drawbacks, or outright
> > restrictions to using multiple simultaneous COPY commands to load
> data into
> > the same table?

Do you mean, multiple COPY commands (connections) being putline'd from
the same thread (process)? I have indirect evidence that this may hurt.

Two copy commands from different threads/processes are fine, and can
help, if they alternate contention on some other resource (disk/CPU).

I'm basing this on being at the third generation of a COPY
implementation. The app loads about 1M objects/hour from 6 servers.
Each object is split across four tables.
The batch load opens four connections and firehoses records down each.
A batch is 10K objects.

COPY invokes all the same logic as INSERT on the server side
(rowexclusive locking, transaction log, updating indexes, rules).
The difference is that all the rows are inserted as a single
transaction. This reduces the number of fsync's on the xlog,
which may be a limiting factor for you. You'll want to crank
WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though.
One of my streams has 6K records; I run with WB=1000, CS=128.

The downside I found with multiple clients inserting large blocks of
rows was, that they serialized. I THINK that's because at some point
they all needed to lock the same portions of the same indexes. I'm still
working on how to avoid that, tuning the batch size and inserting into a
 "queue" table with fewer indexes.

COPY (via putline) didn't do measurably better than INSERT until I
batched 40 newline-separate rows into one putline call, which improved
it 2-3:1. The suspect problem was stalling on the TCP stream; the driver
was flushing small packets. This may or may not be relevant to you;
depends on how much processing (waiting) your app does between posting
of rows.

In such a case, writing alternately to two TCP streams from the same
process increases the likelihood of a stall. I've never tested that
set-up; it would have been heading AWAY from the solution in my case.

Hope that helps.
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


Re:

От
Mike Rylander
Дата:
On 5/3/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Steven Rosenstein <srosenst@us.ibm.com> writes:
> > My question is, are there any advantages, drawbacks, or outright
> > restrictions to using multiple simultaneous COPY commands to load data into
> > the same table?
>
> It will work; not sure about whether there is any performance benefit.
> I vaguely recall someone having posted about doing this, so you might
> check the archives.
>

I may be one of Tom's vague "voices". ;)  The only issue would be that
you need to remove all you UNIQUE constraints before sending multiple
COPYs to the server.  This includes the PRIMARY KEY constraint.  To
the backend, COPY is just like INSERT and all constraints need to be
checked and this will block the commit of one of the COPY streams.

However, multiple COPYs may no be needed.  I regularly load several
table totaling around 50M rows with a single COPY per table.  I drop
(actually, this is during DB reload, so I don't yet create...) all
fkeys, constraints and indexes and the data loads in a matter of 5
minutes or so.

Hope that helps!

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: COPY vs INSERT

От
"David Roussel"
Дата:
> COPY invokes all the same logic as INSERT on the server side
> (rowexclusive locking, transaction log, updating indexes, rules).
> The difference is that all the rows are inserted as a single
> transaction. This reduces the number of fsync's on the xlog,
> which may be a limiting factor for you. You'll want to crank
> WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though.
> One of my streams has 6K records; I run with WB=1000, CS=128.

So what's the difference between a COPY and a batch of INSERT
statements.  Also, surely, fsyncs only occur at the end of a
transaction, no need to fsync before a commit has been issued, right?

David

Re: COPY vs INSERT

От
Mischa Sandberg
Дата:
Quoting David Roussel <pgsql-performance@diroussel.xsmail.com>:

> > COPY invokes all the same logic as INSERT on the server side
> > (rowexclusive locking, transaction log, updating indexes, rules).
> > The difference is that all the rows are inserted as a single
> > transaction. This reduces the number of fsync's on the xlog,
> > which may be a limiting factor for you. You'll want to crank
> > WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though.
> > One of my streams has 6K records; I run with WB=1000, CS=128.
>
> So what's the difference between a COPY and a batch of INSERT
> statements.  Also, surely, fsyncs only occur at the end of a
> transaction, no need to fsync before a commit has been issued,
> right?

Sorry, I was comparing granularities the other way araound. As far as
xlog is concerned, a COPY is ALWAYS one big txn, no matter how many
putline commands you use to feed the copy. With inserts, you can choose
whether to commit every row, every nth row, etc.

Copy makes better use of the TCP connection for transmission. COPY uses
the TCP connection like a one-way pipe. INSERT is like an RPC: the
sender has to wait until the insert's return status roundtrips.
--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


Re: COPY vs INSERT

От
John A Meinel
Дата:
David Roussel wrote:
>>COPY invokes all the same logic as INSERT on the server side
>>(rowexclusive locking, transaction log, updating indexes, rules).
>>The difference is that all the rows are inserted as a single
>>transaction. This reduces the number of fsync's on the xlog,
>>which may be a limiting factor for you. You'll want to crank
>>WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though.
>>One of my streams has 6K records; I run with WB=1000, CS=128.
>
>
> So what's the difference between a COPY and a batch of INSERT
> statements.  Also, surely, fsyncs only occur at the end of a
> transaction, no need to fsync before a commit has been issued, right?

I think COPY also has the advantage that for index updates it only grabs
the lock once, rather than grabbing and releasing for each row. But I
believe you are right that fsync only happens on COMMIT.

>
> David

John
=:->

Вложения

Re: COPY vs INSERT

От
Christopher Petrilli
Дата:
On 5/4/05, Mischa Sandberg <mischa.sandberg@telus.net> wrote:
> Quoting David Roussel <pgsql-performance@diroussel.xsmail.com>:
>
> > > COPY invokes all the same logic as INSERT on the server side
> > > (rowexclusive locking, transaction log, updating indexes, rules).
> > > The difference is that all the rows are inserted as a single
> > > transaction. This reduces the number of fsync's on the xlog,
> > > which may be a limiting factor for you. You'll want to crank
> > > WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though.
> > > One of my streams has 6K records; I run with WB=1000, CS=128.
> >
> > So what's the difference between a COPY and a batch of INSERT
> > statements.  Also, surely, fsyncs only occur at the end of a
> > transaction, no need to fsync before a commit has been issued,
> > right?
>
> Sorry, I was comparing granularities the other way araound. As far as
> xlog is concerned, a COPY is ALWAYS one big txn, no matter how many
> putline commands you use to feed the copy. With inserts, you can choose
> whether to commit every row, every nth row, etc.
>
> Copy makes better use of the TCP connection for transmission. COPY uses
> the TCP connection like a one-way pipe. INSERT is like an RPC: the
> sender has to wait until the insert's return status roundtrips.

I have found even greater performance increases by using COPY FROM
<filename> not COPY FROM STDIN.  This allows the backend process to
directly read the file, rather than shoving it over a pipe (thereby
potentially hitting the CPU multiple times).  My experience is that
this is anywhere from 5-10x faster than INSERT statements on the
whole, and sometimes 200x.

Chris

--
| Christopher Petrilli
| petrilli@gmail.com

Re: COPY vs INSERT

От
Kris Jurka
Дата:

On Wed, 4 May 2005, Mischa Sandberg wrote:

> Copy makes better use of the TCP connection for transmission. COPY uses
> the TCP connection like a one-way pipe. INSERT is like an RPC: the
> sender has to wait until the insert's return status roundtrips.

Not true.  A client may send any number of Bind/Execute messages on a
prepared statement before a Sync message.  So multiple inserts may be sent
in one network roundtrip.  This is exactly how the JDBC driver
implements batch statements.  There is some limit to the number of queries
in flight at any given moment because there is the potential to deadlock
if both sides of network buffers are filled up and each side is blocked
waiting on a write.  The JDBC driver has conservatively selected 256 as
the maximum number of queries to send at once.

Kris Jurka

Re: COPY vs INSERT

От
Christopher Kings-Lynne
Дата:
> So what's the difference between a COPY and a batch of INSERT
> statements.  Also, surely, fsyncs only occur at the end of a
> transaction, no need to fsync before a commit has been issued, right?

With COPY, the data being inserted itself does not have to pass through
the postgresql parser.

Chris

Re: COPY vs INSERT

От
Tom Lane
Дата:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> So what's the difference between a COPY and a batch of INSERT
>> statements.  Also, surely, fsyncs only occur at the end of a
>> transaction, no need to fsync before a commit has been issued, right?

> With COPY, the data being inserted itself does not have to pass through
> the postgresql parser.

Also, there is a whole lot of one-time-per-statement overhead that can
be amortized across many rows instead of only one.  Stuff like opening
the target table, looking up the per-column I/O conversion functions,
identifying trigger functions if any, yadda yadda.  It's not *that*
expensive, but compared to an operation as small as inserting a single
row, it's significant.

            regards, tom lane

Re: COPY vs INSERT

От
Mischa Sandberg
Дата:
Quoting Kris Jurka <books@ejurka.com>:

> On Wed, 4 May 2005, Mischa Sandberg wrote:
>
> > Copy makes better use of the TCP connection for transmission. COPY
> uses
> > the TCP connection like a one-way pipe. INSERT is like an RPC: the
> > sender has to wait until the insert's return status roundtrips.
>
> Not true.  A client may send any number of Bind/Execute messages on
a
> prepared statement before a Sync message.  So multiple inserts may
be
> sent
> in one network roundtrip.  This is exactly how the JDBC driver
> implements batch statements.  There is some limit to the number of
> queries
> in flight at any given moment because there is the potential to
> deadlock
> if both sides of network buffers are filled up and each side is
> blocked
> waiting on a write.  The JDBC driver has conservatively selected 256
> as
> the maximum number of queries to send at once.

Hunh. Interesting optimization in the JDBC driver. I gather it is
sending a string of (;)-separated inserts. Sounds like
efficient-but-risky stuff we did for ODBC drivers at Simba ... gets
interesting when one of the insert statements in the middle fails.
Good to know. Hope that the batch size is parametric, given that you
can have inserts with rather large strings bound to 'text' columns in
PG --- harder to identify BLOBs when talking to PG, than when talking
to MSSQL/Oracle/Sybase.

--
Engineers think equations approximate reality.
Physicists think reality approximates the equations.
Mathematicians never make the connection.


Re: COPY vs INSERT

От
Kris Jurka
Дата:

On Wed, 4 May 2005, Mischa Sandberg wrote:

> Quoting Kris Jurka <books@ejurka.com>:
>
> > Not true.  A client may send any number of Bind/Execute messages on
> > a prepared statement before a Sync message.

> Hunh. Interesting optimization in the JDBC driver. I gather it is
> sending a string of (;)-separated inserts.

No, it uses the V3 protocol and a prepared statement and uses
Bind/Execute, as I mentioned.

> Sounds like efficient-but-risky stuff we did for ODBC drivers at Simba
> ... gets interesting when one of the insert statements in the middle
> fails.

When running inside a transaction (as you really want to do anyway when
bulk loading) it is well defined, it is a little odd for auto commit mode
though.  In autocommit mode the transaction boundary is at the Sync
message, not the individual Execute messages, so you will get some
rollback on error.  The JDBC spec is poorly defined in this area, so we
can get away with this.

> Good to know. Hope that the batch size is parametric, given that
> you can have inserts with rather large strings bound to 'text' columns
> in PG --- harder to identify BLOBs when talking to PG, than when talking
> to MSSQL/Oracle/Sybase.

The batch size is not a parameter and I don't think it needs to be.  The
issue of filling both sides of network buffers and deadlocking only needs
to be avoided on one side.  The response to an insert request is small and
not dependent on the size of the data sent, so we can send as much as we
want as long as the server doesn't send much back to us.

Kris Jurka

Re: COPY vs INSERT

От
Keith Worthington
Дата:
Christopher Petrilli wrote:
> On 5/4/05, Mischa Sandberg <mischa.sandberg@telus.net> wrote:
>
>>Quoting David Roussel <pgsql-performance@diroussel.xsmail.com>:
>>
>>
>>>>COPY invokes all the same logic as INSERT on the server side
>>>>(rowexclusive locking, transaction log, updating indexes, rules).
>>>>The difference is that all the rows are inserted as a single
>>>>transaction. This reduces the number of fsync's on the xlog,
>>>>which may be a limiting factor for you. You'll want to crank
>>>>WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though.
>>>>One of my streams has 6K records; I run with WB=1000, CS=128.
>>>
>>>So what's the difference between a COPY and a batch of INSERT
>>>statements.  Also, surely, fsyncs only occur at the end of a
>>>transaction, no need to fsync before a commit has been issued,
>>>right?
>>
>>Sorry, I was comparing granularities the other way araound. As far as
>>xlog is concerned, a COPY is ALWAYS one big txn, no matter how many
>>putline commands you use to feed the copy. With inserts, you can choose
>>whether to commit every row, every nth row, etc.
>>
>>Copy makes better use of the TCP connection for transmission. COPY uses
>>the TCP connection like a one-way pipe. INSERT is like an RPC: the
>>sender has to wait until the insert's return status roundtrips.
>
>
> I have found even greater performance increases by using COPY FROM
> <filename> not COPY FROM STDIN.  This allows the backend process to
> directly read the file, rather than shoving it over a pipe (thereby
> potentially hitting the CPU multiple times).  My experience is that
> this is anywhere from 5-10x faster than INSERT statements on the
> whole, and sometimes 200x.
>
> Chris
>

Unfortunately, COPY FROM '<file>' can only be done by a superuser.  If
you that option then that is great.  If not...

--
Kind Regards,
Keith

Re: COPY vs INSERT

От
"Jim C. Nasby"
Дата:
On Wed, May 04, 2005 at 10:22:56PM -0400, Tom Lane wrote:
> Also, there is a whole lot of one-time-per-statement overhead that can
> be amortized across many rows instead of only one.  Stuff like opening
> the target table, looking up the per-column I/O conversion functions,
> identifying trigger functions if any, yadda yadda.  It's not *that*
> expensive, but compared to an operation as small as inserting a single
> row, it's significant.

Has thought been given to supporting inserting multiple rows in a single
insert? DB2 supported:

INSERT INTO table VALUES(
    (1,2,3),
    (4,5,6),
    (7,8,9)
);

I'm not sure how standard that is or if other databases support it.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: COPY vs INSERT

От
Dennis Bjorklund
Дата:
On Fri, 6 May 2005, Jim C. Nasby wrote:

> Has thought been given to supporting inserting multiple rows in a single
> insert? DB2 supported:
>
> INSERT INTO table VALUES(
>     (1,2,3),
>     (4,5,6),
>     (7,8,9)
> );
>
> I'm not sure how standard that is or if other databases support it.

The sql standard include this, except that you can not have the outer ().
So it should be

INSERT INTO table VALUES
    (1,2,3),
    (4,5,6),
    (7,8,9);

Do DB2 demand these extra ()?

--
/Dennis Björklund


Re: COPY vs INSERT

От
Harald Fuchs
Дата:
In article <Pine.LNX.4.44.0505060927520.7072-100000@zigo.dhs.org>,
Dennis Bjorklund <db@zigo.dhs.org> writes:

> On Fri, 6 May 2005, Jim C. Nasby wrote:
>> Has thought been given to supporting inserting multiple rows in a single
>> insert? DB2 supported:
>>
>> INSERT INTO table VALUES(
>> (1,2,3),
>> (4,5,6),
>> (7,8,9)
>> );
>>
>> I'm not sure how standard that is or if other databases support it.

> The sql standard include this, except that you can not have the outer ().
> So it should be

> INSERT INTO table VALUES
>     (1,2,3),
>     (4,5,6),
>     (7,8,9);

Since MySQL has benn supporting this idiom for ages, it can't be
standard ;-)

Re: COPY vs INSERT

От
Bruno Wolff III
Дата:
On Fri, May 06, 2005 at 01:51:29 -0500,
  "Jim C. Nasby" <decibel@decibel.org> wrote:
> On Wed, May 04, 2005 at 10:22:56PM -0400, Tom Lane wrote:
> > Also, there is a whole lot of one-time-per-statement overhead that can
> > be amortized across many rows instead of only one.  Stuff like opening
> > the target table, looking up the per-column I/O conversion functions,
> > identifying trigger functions if any, yadda yadda.  It's not *that*
> > expensive, but compared to an operation as small as inserting a single
> > row, it's significant.
>
> Has thought been given to supporting inserting multiple rows in a single
> insert? DB2 supported:
>
> INSERT INTO table VALUES(
>     (1,2,3),
>     (4,5,6),
>     (7,8,9)
> );
>
> I'm not sure how standard that is or if other databases support it.

It's on the TODO list. I don't remember anyone bringing this up for about
a year now, so I doubt anyone is actively working on it.

Re: COPY vs INSERT

От
Tom Lane
Дата:
Bruno Wolff III <bruno@wolff.to> writes:
>   "Jim C. Nasby" <decibel@decibel.org> wrote:
>> Has thought been given to supporting inserting multiple rows in a single
>> insert?

> It's on the TODO list. I don't remember anyone bringing this up for about
> a year now, so I doubt anyone is actively working on it.

It is on TODO but I think it is only there for standards compliance.
It won't produce near as much of a speedup as using COPY does ---
in particular, trying to put thousands of rows through at once with
such a command would probably be a horrible idea.  You'd still have
to pay the price of lexing/parsing, and there would also be considerable
flailing about with deducing the data type of the VALUES() construct.
(Per spec that can be used in SELECT FROM, not only in INSERT, and so
it's not clear to what extent we can use knowledge of the insert target
columns to avoid running the generic union-type-resolution algorithm for
each column of the VALUES() :-(.)  Add on the price of shoving an
enormous expression tree through the planner and executor, and it starts
to sound pretty grim.

            regards, tom lane

Re: COPY vs INSERT

От
"Jim C. Nasby"
Дата:
On Fri, May 06, 2005 at 09:30:46AM +0200, Dennis Bjorklund wrote:
> The sql standard include this, except that you can not have the outer ().
> So it should be
>
> INSERT INTO table VALUES
>     (1,2,3),
>     (4,5,6),
>     (7,8,9);
>
> Do DB2 demand these extra ()?

My recollection is that it does, but it's been a few years...
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"