Обсуждение: Issues with upserts
The wiki entry https://wiki.postgresql.org/wiki/UPSERT kind of suggests that with the introduction of ON CONFLICT DO UPDATE the problem of upserts is solved. But is it?
A common use case for upserts is to keep a table up to date from an external data source. So you might have a cron job that runs a relatively large batch of upserts every couple of minutes.
I have found that this use case is not adequately covered by ON CONFLICT DO UPDATE for two reasons:
- New versions are created for all rows, even if the data is identical. This quickly fills up the WAL and puts unnecessary load on the tablespace drives.
- If the conflict target is not the serial column, the sequence backing the serial column gets incremented for every row. This quickly reaches the point where the serial for new rows exceeds the range of an integer.
Does this mean I have to SELECT the data first and do the conflict check in the application or is there a better SQL-only way?
- New versions are created for all rows, even if the data is identical. This quickly fills up the WAL and puts unnecessary load on the tablespace drives.
- If the conflict target is not the serial column, the sequence backing the serial column gets incremented for every row. This quickly reaches the point where the serial for new rows exceeds the range of an integer.
The wiki entry https://wiki.postgresql.org/
wiki/UPSERT kind of suggests that with the introduction of ON CONFLICT DO UPDATE the problem of upserts is solved. But is it?
A common use case for upserts is to keep a table up to date from an external data source. So you might have a cron job that runs a relatively large batch of upserts every couple of minutes.
I have found that this use case is not adequately covered by ON CONFLICT DO UPDATE for two reasons:
- New versions are created for all rows, even if the data is identical. This quickly fills up the WAL and puts unnecessary load on the tablespace drives.
- If the conflict target is not the serial column, the sequence backing the serial column gets incremented for every row. This quickly reaches the point where the serial for new rows exceeds the range of an integer.
Does this mean I have to SELECT the data first and do the conflict check in the application or is there a better SQL-only way?
Jeremy Smith wrote:
It sounds like you aren't adding a WHERE clause to prevent the duplicate rows from being updated. It would help if you could share your query, but in general this could look like this:
INSERT INTO my_table (col1, col2)
SELECT col1, col2 FROM other_table
ON CONFLICT (col1, col2) DO UPDATE SET (col1, col2) = (EXCLUDED.col1, EXCLUDED.col2)
WHERE (my_table.col1, my_table.col2) IS DISTINCT FROM (EXCLUDED.col1, EXCLUDED.col2)
Here’s an example:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=b48d062d2eedbab14157359694b16081
CREATE TABLE t (
id serial PRIMARY KEY,
name text NOT NULL UNIQUE,
address text NOT NULL
);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
SELECT last_value FROM t_id_seq;
This will yield “8”, showing that new sequence numbers have been generated for each attempt.
=?utf-8?Q?Andr=C3=A9_H=C3=A4nsel?= <andre@webkr.de> writes: > This will yield “8”, showing that new sequence numbers have been generated for each attempt. Well, yeah, because the INSERT has to be attempted first, and that includes forming the whole candidate row including the nextval() result. If you're expecting a serial ID column to not have holes in the sequence of values, you're going to be sadly disappointed, whether you use ON CONFLICT UPDATE or not. regards, tom lane
SELECT last_value FROM t_id_seq;
This will yield “8”, showing that new sequence numbers have been generated for each attempt.
On 7/13/22 07:58, André Hänsel wrote: > Jeremy Smith wrote: > CREATE TABLE t ( > > id serial PRIMARY KEY, > > name text NOT NULL UNIQUE, > > address text NOT NULL > > ); > > This will yield “8”, showing that new sequence numbers have been > generated for each attempt. > If running out of id's is a concern use bigserial instead of serial as it uses bigint: bigint -9223372036854775808 to +9223372036854775807 vs integer for serial: integer -2147483648 to +2147483647 -- Adrian Klaver adrian.klaver@aklaver.com