Обсуждение: Best strategy for bulk inserts where some violate unique constraint?

Поиск
Список
Период
Сортировка

Best strategy for bulk inserts where some violate unique constraint?

От
Denis Papathanasiou
Дата:
I'm using Pyscopg2 to insert multiple rows at once, using a dictionary as shown at the bottom of this page: https://wiki.postgresql.org/wiki/Psycopg2_Tutorial

I notice that even if just one entry violates a table unique constraint (an IntegrityError in psycopg2 parlance), the entire contents of the dictionary do not get inserted into the table.

Is there any way to tell postgres: it's ok to reject individual entries that may violate table unique constraints, but at the same time allow those which do not?

It's too expensive for me to do the inserts one at a time where I could just catch the psycopg2.IntegrityError.

Re: Best strategy for bulk inserts where some violate unique constraint?

От
Daniele Varrazzo
Дата:
On Tue, Nov 5, 2013 at 8:14 PM, Denis Papathanasiou
<denis.papathanasiou@gmail.com> wrote:
> I'm using Pyscopg2 to insert multiple rows at once, using a dictionary as
> shown at the bottom of this page:
> https://wiki.postgresql.org/wiki/Psycopg2_Tutorial
>
> I notice that even if just one entry violates a table unique constraint (an
> IntegrityError in psycopg2 parlance), the entire contents of the dictionary
> do not get inserted into the table.
>
> Is there any way to tell postgres: it's ok to reject individual entries that
> may violate table unique constraints, but at the same time allow those which
> do not?
>
> It's too expensive for me to do the inserts one at a time where I could just
> catch the psycopg2.IntegrityError.

As the last example do you mean the executemany() example? That is not
going to be much faster than repeated execute().

The easiest thing you can do is to switch to autocommit=True and do
repeated execute with insert. If one fail you can just ignore the
IntegrityError and go on.

About as easy you can prepare a statement and execute it repeatedly
using PREPARE/EXECUTE: see
<http://www.postgresql.org/docs/9.2/static/sql-prepare.html> There is
no builtin support for that in psycopg but you can just execute()
these statements. It may save you something. You can also take a look
at this example of a PREPAREing cursor:
<https://gist.github.com/dvarrazzo/3797445>.

However, the fastest way to insert data into Postgres is COPY, see
<http://initd.org/psycopg/docs/cursor.html#cursor.copy_from>. You will
have to present your data as a file. I can't remember what happens
when a record fails the integrity test: I think the other would still
be inserted but you will have to check. A much more robust strategy is
to create a temporary table with the right schema but without
constraints, load the data there using COPY and then move the data to
the final table using INSERT INTO ... SELECT * FROM temp_table WHERE
... and specify a condition to avoid the records that would fail the
constraint.

I would go for COPY, it's by far faster than execute[many], even
including prepare.

-- Daniele


Re: Best strategy for bulk inserts where some violate unique constraint?

От
Adrian Klaver
Дата:
On 11/05/2013 03:31 PM, Daniele Varrazzo wrote:

>
> As the last example do you mean the executemany() example? That is not
> going to be much faster than repeated execute().
>
> The easiest thing you can do is to switch to autocommit=True and do
> repeated execute with insert. If one fail you can just ignore the
> IntegrityError and go on.
>
> About as easy you can prepare a statement and execute it repeatedly
> using PREPARE/EXECUTE: see
> <http://www.postgresql.org/docs/9.2/static/sql-prepare.html> There is
> no builtin support for that in psycopg but you can just execute()
> these statements. It may save you something. You can also take a look
> at this example of a PREPAREing cursor:
> <https://gist.github.com/dvarrazzo/3797445>.
>
> However, the fastest way to insert data into Postgres is COPY, see
> <http://initd.org/psycopg/docs/cursor.html#cursor.copy_from>. You will
> have to present your data as a file. I can't remember what happens
> when a record fails the integrity test: I think the other would still
> be inserted but you will have to check.

It will fail.

http://www.postgresql.org/docs/9.3/interactive/sql-copy.html

COPY stops operation at the first error. This should not lead to
problems in the event of a COPY TO, but the target table will already
have received earlier rows in a COPY FROM. These rows will not be
visible or accessible, but they still occupy disk space. This might
amount to a considerable amount of wasted disk space if the failure
happened well into a large copy operation. You might wish to invoke
VACUUM to recover the wasted space.

I would suggest checking out pg_loader it is designed to deal with this
scenario.


A much more robust strategy is
> to create a temporary table with the right schema but without
> constraints, load the data there using COPY and then move the data to
> the final table using INSERT INTO ... SELECT * FROM temp_table WHERE
> ... and specify a condition to avoid the records that would fail the
> constraint.
>
> I would go for COPY, it's by far faster than execute[many], even
> including prepare.
>
> -- Daniele
>
>


--
Adrian Klaver
adrian.klaver@gmail.com


Re: Best strategy for bulk inserts where some violate unique constraint?

От
Denis Papathanasiou
Дата:
On Tue, Nov 5, 2013 at 6:31 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
As the last example do you mean the executemany() example?

Yes, I was referring to this code example:

> If you had the following:
> namedict = ({"first_name":"Joshua", "last_name":"Drake"},
>             {"first_name":"Steven", "last_name":"Foo"},
>             {"first_name":"David", "last_name":"Bar"})
> You could easily insert all three rows within the dictionary by using:
> cur = conn.cursor()
> cur.executemany("""INSERT INTO bar(first_name,last_name) VALUES (%(first_name)s, %(last_name)s)""", namedict)
> The cur.executemany statement will automatically iterate through the dictionary and execute the INSERT query for each row.

What I wound up doing in my own code is this:

    # process each {columns=values} dict
    try:
        cur.executemany("INSERT INTO ...", insert_dict)
    except psycopg2.IntegrityError:
        conn.commit()

Hoping that by catching the unique constraint error with a commit, it would at least preserve the inserts from the dict that do not violate the table constraints, but that is not how postgres works.
 
That is not
going to be much faster than repeated execute().

Really?
 
The easiest thing you can do is to switch to autocommit=True and do
repeated execute with insert. If one fail you can just ignore the
IntegrityError and go on.

Ok, I'll try that, thanks.  

About as easy you can prepare a statement and execute it repeatedly
using PREPARE/EXECUTE: see
<http://www.postgresql.org/docs/9.2/static/sql-prepare.html> There is
no builtin support for that in psycopg but you can just execute()
these statements. It may save you something. You can also take a look
at this example of a PREPAREing cursor:
<https://gist.github.com/dvarrazzo/3797445>.

Thanks for the example.
 
However, the fastest way to insert data into Postgres is COPY, see
<http://initd.org/psycopg/docs/cursor.html#cursor.copy_from>. You will
have to present your data as a file. I can't remember what happens
when a record fails the integrity test: I think the other would still
be inserted but you will have to check. A much more robust strategy is
to create a temporary table with the right schema but without
constraints, load the data there using COPY and then move the data to
the final table using INSERT INTO ... SELECT * FROM temp_table WHERE
... and specify a condition to avoid the records that would fail the
constraint.

I would go for COPY, it's by far faster than execute[many], even
including prepare.

I saw Adrian's reply about how this still won't do what I need in terms of ignoring the constraint violators and keeping the valid entries; will switching autocommit to True first have the desired effect? 

Re: Best strategy for bulk inserts where some violate unique constraint?

От
Denis Papathanasiou
Дата:
On Tue, Nov 5, 2013 at 6:38 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote:
[snip]
I would suggest checking out pg_loader it is designed to deal with this scenario.

Thanks for that suggestion.

Is this where the most current repo is: http://pgfoundry.org/projects/pgloader/  

Re: Best strategy for bulk inserts where some violate unique constraint?

От
Adrian Klaver
Дата:
On 11/06/2013 06:50 AM, Denis Papathanasiou wrote:
> On Tue, Nov 5, 2013 at 6:31 PM, Daniele Varrazzo

>
>
> I saw Adrian's reply about how this still won't do what I need in terms
> of ignoring the constraint violators and keeping the valid entries; will
> switching autocommit to True first have the desired effect?

No, a particular COPY is treated as a single transaction.


--
Adrian Klaver
adrian.klaver@gmail.com


Re: Best strategy for bulk inserts where some violate unique constraint?

От
Daniele Varrazzo
Дата:
On Wed, Nov 6, 2013 at 2:54 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote:
> On 11/06/2013 06:50 AM, Denis Papathanasiou wrote:
>>
>> On Tue, Nov 5, 2013 at 6:31 PM, Daniele Varrazzo
>>
>> I saw Adrian's reply about how this still won't do what I need in terms
>> of ignoring the constraint violators and keeping the valid entries; will
>> switching autocommit to True first have the desired effect?
>
> No, a particular COPY is treated as a single transaction.

If I'm not mistaken, copying into a temporary table and then inserting
only the wanted record into the target table should work as expected
and leave no bloat. Staying into the psycopg realm it should be the
most efficient way to load data; outside psycopg world there could be
more efficient stand-alone solutions.

-- Daniele


Re: Best strategy for bulk inserts where some violate unique constraint?

От
Adrian Klaver
Дата:
On 11/06/2013 07:02 AM, Daniele Varrazzo wrote:
> On Wed, Nov 6, 2013 at 2:54 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote:
>> On 11/06/2013 06:50 AM, Denis Papathanasiou wrote:
>>>
>>> On Tue, Nov 5, 2013 at 6:31 PM, Daniele Varrazzo
>>>
>>> I saw Adrian's reply about how this still won't do what I need in terms
>>> of ignoring the constraint violators and keeping the valid entries; will
>>> switching autocommit to True first have the desired effect?
>>
>> No, a particular COPY is treated as a single transaction.
>
> If I'm not mistaken, copying into a temporary table and then inserting
> only the wanted record into the target table should work as expected
> and leave no bloat. Staying into the psycopg realm it should be the
> most efficient way to load data; outside psycopg world there could be
> more efficient stand-alone solutions.

Agreed and that is what I often do, it just adds a step.

>
> -- Daniele
>


--
Adrian Klaver
adrian.klaver@gmail.com