Re: COPY issue(gsoc project)

Поиск

Список

Период

Сортировка

От	longlong
Тема	Re: COPY issue(gsoc project)
Дата	13 марта 2008 г. 23:45:43
Msg-id	d9f0a46b0803131945x691c9626ma673628340b2e012@mail.gmail.com обсуждение исходный текст
Ответ на	COPY issue(gsoc project) (longlong <asfnuts@gmail.com>)
Ответы	Re: COPY issue(gsoc project)
Список	pgsql-hackers

Дерево обсуждения

2008/3/12, Neil Conway <neilc@samurai.com>:

I don't see why creating index entries in bulk has anything to do with
COPY vs. INSERT: if a lot of rows are being loaded into the table in a
single command, it would be a win to create the index entries in bulk,
regardless of whether COPY or INSERT ... SELECT is being used.

In any case, the "create indexes in bulk" hasn't actually been
implemented in mainline Postgres...

I mentioned pgloader just as an example of an existing implementation of
the "error recovery in COPY" idea. The issues with doing an
implementation of error recovery in the backend that I see are:

    * in order to be sure that you can recover from an error, you
      need to abort the current subtransaction

    * starting and committing a subtransaction for every row of the COPY
      would be too expensive

    * therefore, start and commit a subtransaction for every "n" rows
      of input. If an error occurs, you lose at most "n-1" rows of
      valid input, which you need to backup and reinsert. There are
      various approaches to choosing "n" (statically, based on the
      error rate of previous batches in the same load, etc.).

-Neil

i think this is a better idea.
from NikhilS http://archives.postgresql.org/pgsql-hackers/2007-12/msg00584.php
But instead of using a per insert or a batch insert substraction, I am thinking that we can start off a subtraction and continue it till we encounter a failure. The moment an error is encountered, since we have the offending (already in heap) tuple around, we can call a simple_heap_delete on the same and commit (instead of aborting) this subtransaction after doing some minor cleanup. This current input data row can also be logged into a bad file. Recall that we need to only handle those errors in which the simple_heap_insert is successful, but the index insertion or the after row insert trigger causes an error. The rest of the load then can go ahead with the start of a new subtransaction.
the simplest thing are often the best.
i think it's hard to implement or some other deficiency since you want subtransaction or every "n" rows.

you have mentioned that the 'n' can be changed according when and where the error happened in thread "Re: VLDB Features" .this is like some mechanisms in tcp Congestion Control.but you can't ignore the time wasted in subtransaction before it encounters an error especially when the 'n' is big.
i don't know the cost of a subtransaction(begin and commit) and an copy line reading(CopyReadLine()) exactly. so i just calculate the number of subtransactions.
f(n)=(1-(1-p)^n)*m+m/n
m is the number of lines. n is the subtransaction lines. p is the possibility of each row encounters an error.
big 'n' can reduce the number of subtransaction(m/n), but also increase the possibility of having a error. unless the p is extremely small, choosing a big 'n' is a big mistake.
in fact the errors always get together (my experience), the situation may be a little better.

however, the idea(from NikhilS) that i start with is the perfect solution. yes i have seen in the email archives in thread "Re: VLDB Features" and i notice some disagreements about commit problems and etc. this won't be a problem since so many similar problems have been solved in pg.

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: COPY issue(gsoc project)