Re: Populating large tables with occasional bad values

Поиск
Список
Период
Сортировка
От John T. Dow
Тема Re: Populating large tables with occasional bad values
Дата
Msg-id 200806132215.m5DMFRrO015727@web2.nidhog.com
обсуждение исходный текст
Ответ на Populating large tables with occasional bad values  ("John T. Dow" <john@johntdow.com>)
Список pgsql-jdbc
I have a solution that I am happy with, yielding nearly a 10-fold improvement in speed.

I tried a quick and dirty experiment with different batch sizes with a small table of 750 rows.

Without using batches, the insert took 64 seconds. (Auto commit true for this and the other tests.)

Batch size 10, 13 seconds.

Batch size 100, 5 seconds.

This code was very simple -- if any error in the batch, discard the entire batch. This gives an idea of what
performancewould be like.  

Of course, with a batch size of 10 and two errors, the number of rows loaded is short by 17 because two entire batches
werediscarded. With a batch size of 100, it's short by 150. (I also neglected to send the last, partial batch, this
beingjust a quick and dirty experiment.) 

Anyway, a batch size of 10 yields a performance improvement of 5. Batch size 100 it's 10.

Craig wrote:

>IMO the best solution with batches is to use batches to store your
>planned operations in a staging table using INSERTs that cannot fail
>under normal circumstances. Then issue a single PL/PgSQL call or a bulk
>UPDATE ... SELECT with an appropriate WHERE clause to apply the updates
>and catch problems.
>
>The approach you describe will work, but it'll be a bit ugly and not
>very fast. If failures are very rare you might find it to be OK, but
>your suggestion of using only 10 statements per batch suggests that
>failures aren't *that* rare.

I need reasonable performance, not the best that money can buy. My clients have tables with tens of thousands of rows,
nothundreds or millions. Also, I want a general purpose solution that will work with no special knowledge of the table
tobe loaded -- just the info obtainable from the meta data (such as column names and types). A staging table such as
yousuggest could be created automatically (copy the columns but change all to char varying so they load). But then
locatingrows with bad values would be a messy bit of sql, although it could be generated by my code. Still, I'd have to
bringover the rows (casting char varying to integer or whatever) that are good and then remove them, leaving the bad
rowsbehind. 

I'll clean up my dirty code, eg make sure it sends the last partial batch. I'll also have it iterate to insert the rows
inany batch that's rejected. The simplest thing to do is simply resend the rows in the batch, one by one. A more
elaboratething to do would be to resend a partial batch, up to the point of the problem, skip the known problem row,
thensend another partial batch with the remaining rows. Keep doing that until all in the original batch have been
successfullyinserted or else written to the error log. 

John


В списке pgsql-jdbc по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: Populating large tables with occasional bad values
Следующее
От: cap20
Дата:
Сообщение: insert data into partitioned tables via Java application