Re: autocommit (true/false) for more than 1 million records

Поиск
Список
Период
Сортировка
От David Johnston
Тема Re: autocommit (true/false) for more than 1 million records
Дата
Msg-id CAKFQuwb5=x9dzf9_Um=-quL=+p_O-rUnHOc50FwaoDetqDq7=A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: autocommit (true/false) for more than 1 million records  (Alex Goncharov <alex.goncharov.usa@gmail.com>)
Список pgsql-performance



On Wed, Aug 27, 2014 at 1:02 AM, Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:
> Thank you, Kevin -- this is helpful.

Thank you David, too.


> But it still leaves questions for me.

Still...


Alex Goncharov <alex.goncharov.usa@gmail.com> wrote:

>>> How do I decide, before starting a COPY data load, whether such a load
>>> protection ("complexity") makes sense ("is necessary")?

This is *the* practical question.


David G Johnston <david.g.johnston@gmail.com> wrote:

> You should probably consider something like: http://pgloader.io/

This is not my question; I want to see if anybody can offer a
meaningful situation evaluation strategy for a potential using or not
using COPY for loading the "big data".

​OK.  Though I presume that given limitations to copy - of which the whole "all-or-nothing" is one - that pointing out more user-friendly API's would be worthwhile.​


If nobody can, fine: it'll give me the reason to claim "Nobody knows".

> Normal case, with normal COPY,

This is the case I am asking about: the COPY operation limitations for
the "big data": until what point a plain COPY can be used.

> you load a bad file into an empty table, it fails, you truncate and
> get better data for the next attempt.

This is not how many businesses operate.


​Yet this is basically what you are asking about....​

 
> How long that will take is system (IOPS/CPU) and data dependent.

"How long", was not the question: my question was originally about the
behavior for a bad record at the end of a large data set submitted to
COPY; when it was stated that the data "in process" becomes an
invisible (until committed) part of the target table, it became
obvious to me that the fundamental question has to be asked: "How much
can fit there, in the temporary operational space (whatever it's
called in PostgreSQL.)?"  "df /mount -> free" or "2^32"?

> The probability of failure is source dependent - and prior
> experience plays a large role here as well.

Not the question.

> If you plan to load directly into a live table the wasted space from
> a bad load could kill you so smaller partial loads are better - if
> you can afford the implicit system inconsistency such a partial load
> would cause.

Not the question.

​These were things to consider when deciding on whether it is worthwhile to split the large file into chunks.​


> If you understand how the system works

I don't, to the necessary extent, so I asked for an expert opinion :)

> you should be able to evaluate the different pieces and come to a
> conclusion as how best to proceed in a specific situation.  No one
> else on this list has the relevant information to make that
> judgement call.

We'll see; too early to tell yet :)

> If this is just asking about rules-of-thumb

Yes.

> I'd say figure out how many records 100MB consumes and COMMIT after that
> many records.

Pardon me: I am running COPY and know how many records are processed
so far?.. (Hmm... can't be.)

​Take you 1TB file, extract the first 100MB, count the number of records-separators.  Commit after that many.​

> 10,000 records is also a nice round number to pick - regardless of
> the amount of MB consumed.  Start there and tweak based upon
> experience.

You are clearly suggesting to split the large data file into many
small ones.  To split very intelligently, on the record boundaries.

And since this is very hard and would involve quite another, external
processing machinery, I am trying to understand until what point this
is safe not to do (subject to what factors.)


​See thoughts to consider from previous e-mail.​


> If you are not taking advantage of the "unlogged load" optimization,

I don't see any way to control this for COPY only.  Are you talking
about the 'postgresql.conf' settings?

​I am not sure if this is the same thing but I am pretty sure he is referring to creating an unlogged table as the copy target - thus avoiding WAL.​


> If you only have 500k free in your archive directory that 1MB file
> will pose a problem...though if you have 4TB of archive available
> the 1TB would fit easily.

So the answer to the "How much data can fit in the COPY storage
areas?" question is solely a "df /mount/point" thing?

I.e. before initiating the COPY, I should:

   ls -l DATA-FILE
   df -m /server/db-cluster/pg_data-or-something

compare the two values and be assured that my COPY will reach the end
of my DATA-FILE (whether is stumbles in the end or not) if the former
value is meaningfully smaller than the latter?

I would take this for the answer. (Let's see if there are other
evaluation suggestions.)

​That should get the copy to succeed though whether you blow up your archives or slaves would not be addressed.

  
> Do you compress your WAL files before shipping them off to the
> archive?  How compressible is your data?

Try to give me the upper limit evaluation strategy, when all the
compression and archive factors are working in my favor.

​Assume worse-case unless you know, from experimentation, what an appropriate compression factor would be.  Keeping in mind I presume you expect other simultaneous activity on the same server.​  If you then fall into a marginal situation you can see whether reducing your estimates to hit you goal is worth the risk.  Though you can incorporate that into your overall planned buffer as well.


> I'm sure people have decent rules-of-thumb here

I would love to hear about them.

> but in the end your specific environment and data, especially at the
> TB scale, is going to be important; and is something that you will
> only discover through testing.

"Don't malloc 2 GB on a system with 100 MB RAM" is a meaningful rule
of thumb, not requiring any testing.  I am looking for similar simple
guiding principles for COPY.

>> > > Say, I am COPYing 100 TB of data and the bad records are close
>> > > to the end of the feed -- how will this all error out?
>> >
>> > The rows will all be in the table, but not visible to any other
>> > transaction.
>>
>> I see.  How much data can I fit there while doing COPY?  Not 1 TB?

> You need the same amount of space that you would require if the file
> imported to completion.

> PostgreSQL is optimistic in this regard - it assumes you will commit
> and so up until failure there is no difference between a good and
> bad import.

I can see it now, thanks.

> I'm not sure what a reasonable formula would be, especially at the TB
> scale,

Make it 1 GB then :)

Can I load 1 GB (uncompressed) via one COPY?

​You cannot load compressed data via COPY.​..

While I have never done such scale myself my conclusion thus far is that with enough hard drive space and, at least depending on the FK situation noted, RAM you should be able to load any size file with a single copy without getting any system errors and/or crashing the server (postgres or OS).

In the simple case the question to split depends on the probability of a data error and how much data (and time) you wish to lose should one occur.


When not -- when "df" says that there is less than 10 GB of free disk
space in the relevant file systems?  Would that be all I need to know?

> but roughly 2x the size of the imported (uncompressed) file would be
> a good starting point (table + WAL).  You likely would want many
> multiples of this unless you are dealing with a one-off event.
> Indexes and dead tuples in particular are likely to be involved.
> You get some leeway depending on compression but that is data
> specific and thus something you have to test yourself if you are
> operating at the margin of your system's resources.

I am willing to accept any factor -- 2x, 10x. I want to be certain the
factor of what over what, though.  So far, only the "df-free-space" to
"data-file-size" consideration has come up.


RAM did as well and that was not enumerated - other than it being optional in the case that no FKs are defined.

I'm not sure what kind of overhead there is on WAL and data pages but there is going to be some.  At scale hopefully compression would wash out the overhead so figuring 2x for any stored data seems like reasonable disk free space required for the most basic scenario.  Indexes count as data - you'd likely want to consider at least one that operates as the primary key.

Kevin is a better source for this than I - mostly I'm drawing conclusions from what I read in his post.

David J.

В списке pgsql-performance по дате отправления:

Предыдущее
От: Alex Goncharov
Дата:
Сообщение: Re: autocommit (true/false) for more than 1 million records
Следующее
От: Albe Laurenz
Дата:
Сообщение: Re: autocommit (true/false) for more than 1 million records