Re: Practical error logging for very large COPY statements

Поиск
Список
Период
Сортировка
От Christopher Kings-Lynne
Тема Re: Practical error logging for very large COPY statements
Дата
Msg-id 43827BC9.7050105@familyhealth.com.au
обсуждение исходный текст
Ответ на Practical error logging for very large COPY statements  (Simon Riggs <simon@2ndquadrant.com>)
Ответы Re: Practical error logging for very large COPY  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-hackers
Seems similar to the pgloader project on pgfoundry.org.

Chris

Simon Riggs wrote:
> If you've ever loaded 100 million rows, you'll know just how annoying it
> is to find that you have a duplicate row somewhere in there. Experience
> shows that there is always one, whatever oath the analyst swears
> beforehand.
> 
> It's hard to find out which row is the duplicate, plus you've just
> screwed up a big table. It needs a VACUUM, then a reload.
> 
> I'd like to find a way to handle this manual task programmatically. 
> 
> What I'd like to do is add an ERRORTABLE clause to COPY. The main
> problem is how we detect a duplicate row violation, yet prevent it from
> aborting the transaction.
> 
> What I propose is to log uniqueness violations only when there is only a
> single unique index on a table.
> 
> Flow of control would be to:
> 
>     locate page of index where value should go
>     lock index block
>     _bt_check_unique, but don't error
>         if violation then insert row into ERRORTABLE
>     else
>         insert row into data block
>         insert row into unique index
>         unlock index block
>         do other indexes
> 
> Which is very similar code to the recently proposed MERGE logic.
> 
> With that logic, a COPY will run to completion, yet be able to report
> the odd couple of unique index violations in found along the way. More
> importantly we can then handle rows those with another program to locate
> where those errors came from and resolve them.
> 
> 
> 
> In most cases with a single unique index, the index inserts are
> rightmost index entries anyway, so there is scope here for an additional
> optimisation: keep both index and data blocks locked across multiple row
> inserts until either the unique index or the data block fills. Thats
> better than taking a full table lock, since it allows concurrent access
> to the rest of the table, but its also more efficient than continually
> re-requesting the same blocks (which looks like about 10-15% saving on
> performance from hash lookups, lock/unlock, etc).
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Bug in predicate indexes?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: PostgreSQL 8.1.0 catalog corruption