[HACKERS] unique index violation after pg_upgrade to PG10

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема [HACKERS] unique index violation after pg_upgrade to PG10
Дата
Msg-id 20171024181453.GI21735@telsasoft.com
обсуждение исходный текст
Ответы Re: [HACKERS] unique index violation after pg_upgrade to PG10  (Kenneth Marshall <ktm@rice.edu>)
Список pgsql-hackers
I upgrade another instance to PG10 yesterday and this AM found unique key
violations.

Our application is SELECTing FROM sites WHERE site_location=$1, and if it
doesn't find one, INSERTs one (I know that's racy and not ideal).  We ended up
with duplicate sites, despite a unique index.  We removed the duplicate rows
and reindexed fine.  This is just a heads up with all the detail I can fit in a
mail (but there's more if needed).

ts=# \d sitessite_id          | integer  |           | not null | nextval('sites_site_id_seq'::regclass)site_office
| text     |           |          | site_location    | text     |           |          | 
 
[...]
Indexes:   "sites_pkey" PRIMARY KEY, btree (site_id)   "sites_idx" UNIQUE, btree (site_office, site_location)

ts=# SELECT site_office, site_location, count(*), min(site_id), max(site_id) FROM sites GROUP BY 1,2 HAVING count(*)>1
ORDERBY 1,2;   site_office   | site_location | count | min | max 
 
----------------+---------------+-------+-----+-----CRCLMT-DOEMS0  |               |     2 | 165 | 351CRCLMT-DOEMS0  |
1101         |     2 | 123 | 343CRCLMT-DOEMS0  | 1102          |     2 | 134 | 318CRCLMT-DOEMS0  | 1103          |
2| 145 | 322CRCLMT-DOEMS0  | 1104          |     2 | 156 | 329
 

The duplicate site_ids mean this isn't an issue with row version/visibility due
to XIDs (right?).

ts=# SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND site_location='';
(0 rows)
ts=# SET enable_bitmapscan=off; SET enable_indexscan=off; SELECT 1 FROM sites WHERE site_office='CRCLMT-CEMS0' AND
site_location='';
-[ RECORD 1 ]
?column? | 1

So there's an issue with indices failing to return matching rows (and thereby
allowing inserting duplicate violating rows).

That's the only table/index affected (and the only PG instance thus far
affected).

Note regarding my pg_upgrade: 3 years ago, this was our first and smallest
customer who I upgraded off PG8.4 (to PG9.2 if I recall), and I did it using
pg_dump |pg_restore.  I believe, as a consequence, its postgres database was in
"C" locale and ASCII encoding.  So the last few upgrades (9.3, .4 .5 and .6), I
believe I've manually used initdb --encoding followed by pg_upgrade (else it
fails due to new postgres/template DBs with different locale/encoding from
old).  This upgrade, I finally renamed postgres DB (which has imported CSV logs
and one or two other things) and pg_dump|pg_restore into a new DB with UTF8
encoding, which allowed pg_upgrade to run without special initdb invocation.

I have an LVM snapshot and full CSV logs imported into a table.  I also have a
backup from 22:00 which doesn't have duplicate sites.  Those seem to have been
inserted by our application around 00:30:

These IDs which inserted duplicate rows:
postgres=# SELECT session_id, max(session_line) FROM postgres_log_2017_10_24_0000 WHERE message LIKE 'statement: SELECT
site_idFROM sites WHERE%' GROUP BY 1 ; session_id   | max  
 
---------------+------59eedfb1.5cea |  71459eedfb5.5cf1 | 1741
(2 rows)

postgres=# SELECT log_time, session_id, session_line, left(message,333) FROM postgres_log WHERE
(session_id='59eedfb1.5cea'OR session_id='59eedfb5.5cf1') AND (session_line<6 OR message LIKE '%INSERT INTO site%')
ORDERBY 1,2,3;
 
-[ RECORD 4
]+----------------------------------------------------------------------------------------------------------------------------------------------
log_time     | 2017-10-24 00:37:37.888-06
session_id   | 59eedfb1.5cea
session_line | 4
left         | statement: SELECT site_id FROM sites WHERE
                                   +            |                 site_office = 'CRCLMT-DOEMS0' AND site_location =
'1203'
-[ RECORD 5
]+----------------------------------------------------------------------------------------------------------------------------------------------
log_time     | 2017-10-24 00:37:37.89-06
session_id   | 59eedfb1.5cea
session_line | 5
left         | statement: INSERT INTO sites (site_office,site_location,site_alias)
                                   +            |                         VALUES ('CRCLMT-DOEMS0', '1203', (SELECT
site_idFROM sites                                                          +            |                         WHERE
site_office= 'CRCLMT-CEMS0' AND site_location = '1203'))
 

Note:
I run a script which does various combinations of ANALYZE/VACUUM (FULL/ANALYZE)
following the upgrade, and a script runs nightly with REINDEX and pg_repack
(and a couple of CLUSTER), so you should assume that any combination of those
maintenance commands have been run.

In our reindex/repack log I found the first error due to duplicates:
Tue Oct 24 01:27:53 MDT 2017: sites: sites_idx(repack non-partitioned)...
WARNING: Error creating index "public"."index_61764": ERROR:  could not create unique index "index_61764"
DETAIL:  Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated.
WARNING: Skipping index swapping for "sites", since no new indexes built
WARNING: repack failed for "sites_idx"
reindex: warning, dropping invalid/unswapped index: index_61764

postgres=# SELECT * FROM postgres_log WHERE session_id='59eeeb79.4bd5' AND error_severity='ERROR' ORDER BY 1 DESC LIMIT
33;
log_time               | 2017-10-24 01:27:53.545-06
user_name              | postgres
database               | ts
pid                    | 19413
connection_from        | [local]
session_id             | 59eeeb79.4bd5
session_line           | 10
command_tag            | CREATE INDEX
session_start_time     | 2017-10-24 01:27:53-06
error_severity         | ERROR
message                | could not create unique index "index_61764"
detail                 | Key (site_office, site_location)=(CRCLMT-DOEMS0, 1120) is duplicated.
query                  | CREATE UNIQUE INDEX CONCURRENTLY index_61764 ON sites USING btree (site_office, site_location)
TABLESPACEpg_default
 
application_name       | pg_repack

Disclosure: I see a storage error from 2 days ago (pre-upgrade).  This is a
4-drive RAID5 with spare.  I can't see that we're missing a spare, so I don't
know what else to do besides consider this message to be spurious..

Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825577] hpsa 0000:05:00.0: scsi 4:0:0:0 Aborting command
ffff8f84f2ba7c00Tag:0x00000000:000001f0CDBLen: 6 CDB: 0x1201... SN: 0x0  BEING SENT
 
Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.825583] hpsa 0000:05:00.0: scsi 4:0:0:0: Aborting command RAID
     HP       P410i            controller SSDSmartPathCap- En- Exp=1
 
Oct 22 21:28:16 midrivers-ubuntu kernel: [1429369.898930] hpsa 0000:05:00.0: CDB 12018000600000000000000000000000 was
abortedwith status 0x0
 
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269963] hpsa 0000:05:00.0: invalid command: LUN:0000000000000000
CDB:00000000f00100000000000000000000
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269966] hpsa 0000:05:00.0: probably means device no longer present
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269968] hpsa 0000:05:00.0: scsi 4:0:0:0 Aborting command
ffff8f84f2ba7c00Tag:0x00000000:000001f0CDBLen: 6 CDB: 0x1201... SN: 0x0  SENT, FAILED
 
Oct 22 21:28:17 midrivers-ubuntu kernel: [1429371.269974] hpsa 0000:05:00.0: scsi 4:0:0:0: FAILED to abort command RAID
            HP       P410i            controller SSDSmartPathCap- En- Exp=1
 

sites is perhaps our most central table and I would expect issues there to be
quickly apparent (however this was at least initially a silent failure).public | sites | table | telsasoft | 80 kB | 

Justin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Konstantin Knizhnik
Дата:
Сообщение: [HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] Current int & float overflow checking is slow.