Обсуждение: Inexplicable duplicate rows with unique constraint
I'm trying to track down the cause of some duplicate rows in a table which I would expect to be impossible due to a unique constraint. I'm hoping that somebody here will be able to suggest something I might have missed. The problem relates to a bug filed against our application (https://github.com/matrix-org/synapse/issues/6696). At first I put this down to random data corruption on a single user's postgres instance, but I've now seen three separate reports in as many days and am wondering if there is more to it. We have a table whose schema is as follows: synapse=# \d current_state_events Table "public.current_state_events" Column | Type | Modifiers ------------+------+----------- event_id | text | not null room_id | text | not null type | text | not null state_key | text | not null membership | text | Indexes: "current_state_events_event_id_key" UNIQUE CONSTRAINT, btree (event_id) "current_state_events_room_id_type_state_key_key" UNIQUE CONSTRAINT, btree (room_id, type, state_key) "current_state_events_member_index" btree (state_key) WHERE type = 'm.room.member'::text Despite the presence of the current_state_events_room_id_type_state_key_key constraint, several users have reported seeing errors which suggest that their tables have duplicate rows for the same (room_id, type, state_key) triplet and indeed querying confirms that to be the case: synapse=> select count(*), room_id, type, state_key from current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2; count | room_id | type | state_key -------+-----------------------------------+---------------+------------------------------------- 3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member | @irc_ebi_:darkfasel.net 3 | !HwocBmCtBcHQhILtYQ:matrix.org | m.room.member | @freenode_AlmarShenwan_:matrix.org (2 rows) Further investigation suggests that these are genuinely separate rows rather than duplicate entries in an index. The index appears to consider itself valid: synapse=> select i.* from pg_class c join pg_index i on i.indexrelid=c.oid where relname='current_state_events_room_id_type_state_key_key'; indexrelid | indrelid | indnatts | indisunique | indisprimary | indisexclusion | indimmediate | indisclustered | indisvalid | indcheckxmin | indisready | indislive | indisreplident | indkey | indcollation | indclass | indoption | indexprs | indpred ------------+----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+-----------+----------------+--------+--------------+----------------+-----------+----------+--------- 17023 | 16456 | 3 | t | f | f | t | f | t | f | t | t | f | 2 3 4 | 100 100 100 | 3126 3126 3126 | 0 0 0 | | (1 row) So, question: what could we be doing wrong to get ourselves into this situation? Some other datapoints which may be relevant: * this has been reported by one user on postgres 9.6.15 and one on 10.10, though it's hard to be certain of the version that was running when the duplication occurred * the constraint is added when the table is first created (before any data is added) * At least one user reports that he has recently migrated his database from one server to another via a `pg_dump -C` and later piping into psql.
On 1/16/20 8:50 AM, Richard van der Hoff wrote: > I'm trying to track down the cause of some duplicate rows in a table > which I would expect to be impossible due to a unique constraint. I'm > hoping that somebody here will be able to suggest something I might have > missed. > > The problem relates to a bug filed against our application > (https://github.com/matrix-org/synapse/issues/6696). At first I put this > down to random data corruption on a single user's postgres instance, but > I've now seen three separate reports in as many days and am wondering if > there is more to it. > > We have a table whose schema is as follows: > > synapse=# \d current_state_events > Table "public.current_state_events" > Column | Type | Modifiers > ------------+------+----------- > event_id | text | not null > room_id | text | not null > type | text | not null > state_key | text | not null > membership | text | > Indexes: > "current_state_events_event_id_key" UNIQUE CONSTRAINT, btree > (event_id) > "current_state_events_room_id_type_state_key_key" UNIQUE > CONSTRAINT, btree (room_id, type, state_key) > "current_state_events_member_index" btree (state_key) WHERE type = > 'm.room.member'::text > > Despite the presence of the > current_state_events_room_id_type_state_key_key constraint, several > users have reported seeing errors which suggest that their tables have > duplicate rows for the same (room_id, type, state_key) triplet and > indeed querying confirms that to be the case: > > synapse=> select count(*), room_id, type, state_key from > current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2; > count | room_id | type | state_key > -------+-----------------------------------+---------------+------------------------------------- > > 3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member | > @irc_ebi_:darkfasel.net > 3 | !HwocBmCtBcHQhILtYQ:matrix.org | m.room.member | > @freenode_AlmarShenwan_:matrix.org > (2 rows) > I'm assuming the above are obfuscated? > > Further investigation suggests that these are genuinely separate rows > rather than duplicate entries in an index. If you use length() on the values are they the same? > > The index appears to consider itself valid: > > synapse=> select i.* from pg_class c join pg_index i on > i.indexrelid=c.oid where > relname='current_state_events_room_id_type_state_key_key'; > indexrelid | indrelid | indnatts | indisunique | indisprimary | > indisexclusion | indimmediate | indisclustered | indisvalid | > indcheckxmin | indisready | indislive | indisreplident | indkey | > indcollation | indclass | indoption | indexprs | indpred > ------------+----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+-----------+----------------+--------+--------------+----------------+-----------+----------+--------- > > 17023 | 16456 | 3 | t | f | f > | t | f | t | f | t > | t | f | 2 3 4 | 100 100 100 | 3126 > 3126 3126 | 0 0 0 | | > (1 row) > > So, question: what could we be doing wrong to get ourselves into this > situation? > > Some other datapoints which may be relevant: > * this has been reported by one user on postgres 9.6.15 and one on > 10.10, though it's hard to be certain of the version that was running > when the duplication occurred > * the constraint is added when the table is first created (before any > data is added) > * At least one user reports that he has recently migrated his database > from one server to another via a `pg_dump -C` and later piping into psql. > > -- Adrian Klaver adrian.klaver@aklaver.com
Richard van der Hoff <richard@matrix.org> writes:
> I'm trying to track down the cause of some duplicate rows in a table 
> which I would expect to be impossible due to a unique constraint. I'm 
> hoping that somebody here will be able to suggest something I might have 
> missed.
Since these are text columns, one possibility you should be looking into
is that the indexes have become corrupt due to a change in the operating
system's sorting rules for the underlying locale.  I don't recall details
at the moment, but I do remember that a recent glibc update changed the
sorting rules for some popular locale settings.  If an installation had
applied such an update underneath an existing database, you'd have a
situation where existing entries in an index are not in-order according
to the new behavior of the text comparison operators, leading to havoc
because btree searching relies on the entries being correctly sorted.
Unless you happen to notice searches failing to find rows you know are
there, the first visible symptom is often appearance of "impossible"
duplicate rows, after the search to verify uniqueness of a new entry
fails to find the old entry.
>   * At least one user reports that he has recently migrated his database 
> from one server to another via a `pg_dump -C` and later piping into psql.
Dump-and-restore wouldn't cause this (and, indeed, is one way to clean up
the mess).  But this is suspicious anyway because it suggests there may
have been some general system upgrades going on in the vicinity.
Reindexing all text indexes is the recommended remediation procedure
if you suspect a locale behavior change.  There's some work afoot to
make PG notice the need for this automatically, but it's not done yet.
            regards, tom lane
			
		Richard van der Hoff wrote: > So, question: what could we be doing wrong to get ourselves into this > situation? OS/libc upgrades without reindexing come to mind. See https://wiki.postgresql.org/wiki/Collations > * At least one user reports that he has recently migrated his database > from one server to another via a `pg_dump -C` and later piping into psql. This kind of migration recreates indexes (since it recreates everything) so it's not subject to the same hazard as an OS/libc upgrade without reindexing. In fact it would detect the problem if it existed, as the creation of the unique constraint in the new db would fail if the data in the dump did not satisfy it. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Richard van der Hoff <richard@matrix.org> writes: > > I'm trying to track down the cause of some duplicate rows in a table > > which I would expect to be impossible due to a unique constraint. I'm > > hoping that somebody here will be able to suggest something I might have > > missed. > > Since these are text columns, one possibility you should be looking into > is that the indexes have become corrupt due to a change in the operating > system's sorting rules for the underlying locale. I don't recall details > at the moment, but I do remember that a recent glibc update changed the > sorting rules for some popular locale settings. If an installation had > applied such an update underneath an existing database, you'd have a > situation where existing entries in an index are not in-order according > to the new behavior of the text comparison operators, leading to havoc > because btree searching relies on the entries being correctly sorted. See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on which linux distros updated when. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On 16/01/2020 17:12, Magnus Hagander wrote: > On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Richard van der Hoff <richard@matrix.org> writes: >>> I'm trying to track down the cause of some duplicate rows in a table >>> which I would expect to be impossible due to a unique constraint. I'm >>> hoping that somebody here will be able to suggest something I might have >>> missed. >> >> Since these are text columns, one possibility you should be looking into >> is that the indexes have become corrupt due to a change in the operating >> system's sorting rules for the underlying locale. I don't recall details >> at the moment, but I do remember that a recent glibc update changed the >> sorting rules for some popular locale settings. If an installation had >> applied such an update underneath an existing database, you'd have a >> situation where existing entries in an index are not in-order according >> to the new behavior of the text comparison operators, leading to havoc >> because btree searching relies on the entries being correctly sorted. > > See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on > which linux distros updated when. > Right, thanks to all who have suggested this. It seems like a plausible explanation but it's worth noting that all the indexed data here is (despite being in text columns), plain ascii. I'm surprised that a change in collation rules would change the sorting of such strings, and hence that it could lead to this problem. Am I naive? To answer Adrian's question: the lengths of the values in the indexed columns are identical between the duplicated rows.
Richard van der Hoff wrote: > synapse=> select count(*), room_id, type, state_key from > current_state_events group by 2,3,4 order by count(*) DESC LIMIT 2; > count | room_id | type | state_key > -------+-----------------------------------+---------------+------------------------------------- > 3 | !ueLfVrSWYGMnFnoCbT:darkfasel.net | m.room.member | > @irc_ebi_:darkfasel.net > 3 | !HwocBmCtBcHQhILtYQ:matrix.org | m.room.member | > @freenode_AlmarShenwan_:matrix.org Looking at these columns which are of type text but do not contain words of any particular language, there's probably no point in using a linguistic-aware collation for them. If you maintain the database schema, what you could do to avoid the dependency on the OS collation and stay clear of the particular upgrade difficulty of collations is to use COLLATE "C" for this kind of field, as opposed to the default collation of the database. As a bonus, operations with the "C" collations tend to be faster, sometimes even much faster. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 1/16/20 9:24 AM, Richard van der Hoff wrote: > > On 16/01/2020 17:12, Magnus Hagander wrote: >> On Thu, Jan 16, 2020 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>> Richard van der Hoff <richard@matrix.org> writes: >>>> I'm trying to track down the cause of some duplicate rows in a table >>>> which I would expect to be impossible due to a unique constraint. I'm >>>> hoping that somebody here will be able to suggest something I might >>>> have >>>> missed. >>> >>> Since these are text columns, one possibility you should be looking into >>> is that the indexes have become corrupt due to a change in the operating >>> system's sorting rules for the underlying locale. I don't recall >>> details >>> at the moment, but I do remember that a recent glibc update changed the >>> sorting rules for some popular locale settings. If an installation had >>> applied such an update underneath an existing database, you'd have a >>> situation where existing entries in an index are not in-order according >>> to the new behavior of the text comparison operators, leading to havoc >>> because btree searching relies on the entries being correctly sorted. >> >> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on >> which linux distros updated when. >> > Right, thanks to all who have suggested this. > > It seems like a plausible explanation but it's worth noting that all the > indexed data here is (despite being in text columns), plain ascii. I'm > surprised that a change in collation rules would change the sorting of > such strings, and hence that it could lead to this problem. Am I naive? In psql who does: \l the_database_name show? > > To answer Adrian's question: the lengths of the values in the indexed > columns are identical between the duplicated rows. > > -- Adrian Klaver adrian.klaver@aklaver.com
On 16/01/2020 17:24, Daniel Verite wrote: > Looking at these columns which are of type text but do not > contain words of any particular language, there's probably > no point in using a linguistic-aware collation for them. > > If you maintain the database schema, what you could do to avoid > the dependency on the OS collation and stay clear of the particular > upgrade difficulty of collations is to use COLLATE "C" for this kind of > field, as opposed to the default collation of the database. > As a bonus, operations with the "C" collations tend to be faster, > sometimes even much faster. Good to know, thanks Daniel!
On 16/01/2020 17:27, Adrian Klaver wrote:
> On 1/16/20 9:24 AM, Richard van der Hoff wrote:
>>
>> It seems like a plausible explanation but it's worth noting that all 
>> the indexed data here is (despite being in text columns), plain ascii. 
>> I'm surprised that a change in collation rules would change the 
>> sorting of such strings, and hence that it could lead to this problem. 
>> Am I naive?
> 
> In psql who does:
> 
> \l the_database_name
> 
> show?
synapse=> \l synapse
                                  List of databases
   Name   |  Owner   | Encoding |   Collate   |    Ctype    |   Access 
privileges
---------+----------+----------+-------------+-------------+-----------------------
  synapse | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
postgres=CTc/postgres+
          |          |          |             |             | 
=Tc/postgres         +
          |          |          |             |             | 
synapse=CTc/postgres
(1 row)
			
		Richard van der Hoff <richard@matrix.org> writes: > On 16/01/2020 17:12, Magnus Hagander wrote: >> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on >> which linux distros updated when. > It seems like a plausible explanation but it's worth noting that all the > indexed data here is (despite being in text columns), plain ascii. I'm > surprised that a change in collation rules would change the sorting of > such strings, and hence that it could lead to this problem. Am I naive? Unfortunately, strings containing punctuation do sort differently after these changes, even with all-ASCII data. The example given on that wiki page demonstrates this. RHEL6 (old glibc): $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort 11 1-1 Fedora 30 (new glibc): $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort 1-1 11 I concur with Daniel's suggestion that maybe "C" locale is the thing to use for this data. regards, tom lane
On 16/01/2020 17:48, Tom Lane wrote: > Richard van der Hoff <richard@matrix.org> writes: >> On 16/01/2020 17:12, Magnus Hagander wrote: >>> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on >>> which linux distros updated when. > >> It seems like a plausible explanation but it's worth noting that all the >> indexed data here is (despite being in text columns), plain ascii. I'm >> surprised that a change in collation rules would change the sorting of >> such strings, and hence that it could lead to this problem. Am I naive? > > Unfortunately, strings containing punctuation do sort differently > after these changes, even with all-ASCII data. The example given > on that wiki page demonstrates this. Thank you Tom. I should learn to read properly. I think this very much explains the symptoms we're seeing. Thanks very much for the detailed and very helpful answers!
That's why I created a virtual_string function to squeeze out everything 
but alpha characters and numbers 0-9 from any varchar or text columns 
that I want to use as business key columns.  For example, if I have a 
column named job_name, I will have a companion column named v_job_name.  
The v_ column is to replicate Oracle's virtual column, since postgres 
doesn't have it.  You don't put any values in the v_ column directly. I 
simply have a trigger on insert or update to put the value in the 
v_job_name column using the virtual_string(new.job_name) function.  It's 
the v_job_name column that use in my unique constraint so that I avoid 
any unexpected sorting.  Meanwhile, my job_name column is still human 
readable with whatever characters I want to see, including diacritics.
Here is my function, if you want to try it out:
create or replace function store.virtual_string(string_in text)
returns text as
$body$
    declare
        l_return text;
    begin
        l_return := regexp_replace 
(lower(unaccent(string_in)),'[^0-9a-z]','','g');
        return l_return;
    end;
$body$
   language plpgsql volatile security definer
;
Sue
---
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Susan E Hurst
Principal Consultant
Brookhurst Data LLC
Email: susan.hurst@brookhurstdata.com
Mobile: 314-486-3261
On 2020-01-16 11:48, Tom Lane wrote:
> Richard van der Hoff <richard@matrix.org> writes:
>> On 16/01/2020 17:12, Magnus Hagander wrote:
>>> See https://wiki.postgresql.org/wiki/Locale_data_changes for hints on
>>> which linux distros updated when.
> 
>> It seems like a plausible explanation but it's worth noting that all 
>> the
>> indexed data here is (despite being in text columns), plain ascii. I'm
>> surprised that a change in collation rules would change the sorting of
>> such strings, and hence that it could lead to this problem. Am I 
>> naive?
> 
> Unfortunately, strings containing punctuation do sort differently
> after these changes, even with all-ASCII data.  The example given
> on that wiki page demonstrates this.
> 
> RHEL6 (old glibc):
> 
> $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort
> 11
> 1-1
> 
> Fedora 30 (new glibc):
> 
> $ ( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.utf8 sort
> 1-1
> 11
> 
> I concur with Daniel's suggestion that maybe "C" locale is
> the thing to use for this data.
> 
>             regards, tom lane
			
		On Thu, 2020-01-16 at 16:50 +0000, Richard van der Hoff wrote: > I'm trying to track down the cause of some duplicate rows in a table > which I would expect to be impossible due to a unique constraint. I'm > hoping that somebody here will be able to suggest something I might have > missed. > > The problem relates to a bug filed against our application > (https://github.com/matrix-org/synapse/issues/6696). At first I put this > down to random data corruption on a single user's postgres instance, but > I've now seen three separate reports in as many days and am wondering if > there is more to it. [...] > So, question: what could we be doing wrong to get ourselves into this > situation? > > Some other datapoints which may be relevant: > * this has been reported by one user on postgres 9.6.15 and one on > 10.10, though it's hard to be certain of the version that was running > when the duplication occurred > * the constraint is added when the table is first created (before any > data is added) > * At least one user reports that he has recently migrated his database > from one server to another via a `pg_dump -C` and later piping into psql. I see no hint that this may be the problem, but I have seen corruption like this because of changes in the collations of the C library (which PostgreSQL uses). This only happens with collations other than C, and it cannot be caused by dump/restore. It may, however, be caused by the following: - Upgrading the operating system where PostgreSQL is running to a different glibc. - Streaming replication between machines with different glibc version, and failing over to the standby. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com