Re: autocommit (true/false) for more than 1 million records

Поиск
Список
Период
Сортировка
От Alex Goncharov
Тема Re: autocommit (true/false) for more than 1 million records
Дата
Msg-id CAOnt2=MG4L3Prb=AQ3Etqs3kOZbPDw7Y-ap0MHn75GR0LfV2Fg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: autocommit (true/false) for more than 1 million records  (Kevin Grittner <kgrittn@ymail.com>)
Ответы Re: autocommit (true/false) for more than 1 million records  (David G Johnston <david.g.johnston@gmail.com>)
Re: autocommit (true/false) for more than 1 million records  (Albe Laurenz <laurenz.albe@wien.gv.at>)
Re: autocommit (true/false) for more than 1 million records  (Kevin Grittner <kgrittn@ymail.com>)
Список pgsql-performance
Thank you, Kevin -- this is helpful.

But it still leaves questions for me.

Kevin Grittner <kgrittn@ymail.com> wrote:

> Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:

> > The whole thing is aborted then, and the good 99 records are not
> > making it into the target table.
>
> Right.  This is one reason people often batch such copies or check
> the data very closely before copying in.

How do I decide, before starting a COPY data load, whether such a load
protection ("complexity") makes sense ("is necessary")?

Clearly not needed for 1 MB of data in a realistic environment.

Clearly is needed for loading 1 TB in a realistic environment.

To put it differently: If I COPY 1 TB of data, what criteria should I
use for choosing the size of the chunks to split the data into?

For INSERT-loading, for the database client interfaces offering the
array mode, the performance difference between loading 100 or 1000
rows at a time is usually negligible if any.  Therefore 100- and
1000-row's array sizes are both reasonable choices.

But what is a reasonable size for a COPY chunk?  It can't even be
measured in rows.

Note, that if you have a 1 TB record-formatted file to load, you can't
just split it in 1 MB chunks and feed them to COPY -- the file has to
be split on the record boundaries.

So, splitting the data for COPY is not a trivial operation, and if
such splitting can be avoided, a reasonable operator will avoid it.

But then again: when can it be avoided?

> > My question is: Where are these 99 records have been living, on
> > the database server, while the 100-th one hasn't come yet, and
> > the need to throw the previous data accumulation away has not
> > come yet?
>
> They will have been written into the table.  They do not become
> visible to any other transaction until and unless the inserting
> transaction successfully commits.  These slides may help:
>
> http://momjian.us/main/writings/pgsql/mvcc.pdf

Yeah, I know about the MVCC model...  The question is about the huge
data storage to be reserved without a commitment while the load is not
completed, about the size constrains in effect here.

> > There have to be some limits to the space and/or counts taken by
> > the new, uncommitted, data, while the COPY operation is still in
> > progress.  What are they?
>
> Primarily disk space for the table.

How can that be found?  Is "df /mount/point" the deciding factor? Or
some 2^32 or 2^64 number?

> If you are not taking advantage of the "unlogged load" optimization,
> you will have written Write Ahead Log (WAL) records, too -- which
> (depending on your configuration) you may be archiving.  In that
> case, you may need to be concerned about the archive space required.

"... may need to be concerned ..." if what?  Loading 1 MB? 1 GB? 1 TB?

If I am always concerned, and check something before a COPY, what
should I be checking?  What are the "OK-to-proceed" criteria?

> If you have foreign keys defined for the table, you may get into
> trouble on the RAM used to track pending checks for those
> constraints.  I would recommend adding any FKs after you are done
> with the big bulk load.

I am curious about the simplest case where only the data storage is to
be worried about. (As an aside: the CHECK and NOT NULL constrains are
not a storage factor, right?)

> PostgreSQL does *not* have a "rollback log" which will impose a
> limit.

Something will though, right?  What would that be? The available disk
space on a file system? (I would be surprised.)

> > Say, I am COPYing 100 TB of data and the bad records are close
> > to the end of the feed -- how will this all error out?
>
> The rows will all be in the table, but not visible to any other
> transaction.

I see.  How much data can I fit there while doing COPY?  Not 1 TB?

-- Alex



On Tue, Aug 26, 2014 at 6:33 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:

> Suppose I COPY a huge amount of data, e.g. 100 records.
>
> My 99 records are fine for the target, and the 100-th is not --
> it comes with a wrong record format or a target constraint
> violation.
>
> The whole thing is aborted then, and the good 99 records are not
> making it into the target table.

Right.  This is one reason people often batch such copies or check
the data very closely before copying in.

> My question is: Where are these 99 records have been living, on
> the database server, while the 100-th one hasn't come yet, and
> the need to throw the previous data accumulation away has not
> come yet?

They will have been written into the table.  They do not become
visible to any other transaction until and unless the inserting
transaction successfully commits.  These slides may help:

http://momjian.us/main/writings/pgsql/mvcc.pdf

> There have to be some limits to the space and/or counts taken by
> the new, uncommitted, data, while the COPY operation is still in
> progress.  What are they?

Primarily disk space for the table.  If you are not taking
advantage of the "unlogged load" optimization, you will have
written Write Ahead Log (WAL) records, too -- which (depending on
your configuration) you may be archiving.  In that case, you may
need to be concerned about the archive space required.  If you have
foreign keys defined for the table, you may get into trouble on the
RAM used to track pending checks for those constraints.  I would
recommend adding any FKs after you are done with the big bulk load.

PostgreSQL does *not* have a "rollback log" which will impose a limit.

> Say, I am COPYing 100 TB of data and the bad records are close
> to the end of the feed -- how will this all error out?

The rows will all be in the table, but not visible to any other
transaction.  Autovacuum will clean them out in the background, but
if you want to restart your load against an empty table it might be
a good idea to TRUNCATE that table; it will be a lot faster.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-performance по дате отправления:

Предыдущее
От: Kevin Grittner
Дата:
Сообщение: Re: autocommit (true/false) for more than 1 million records
Следующее
От: David G Johnston
Дата:
Сообщение: Re: autocommit (true/false) for more than 1 million records