I think we experienced something similar.
Now a few words about our setup:
- AWS, i3.8xlarge
- Ubuntu 18.04
- ext4
- It is a shared database, with 8 clusters in total
- Size of each cluster ~1TB
- Each cluster produces ~3TB of WAL every day (plenty of UPDATEs, about 90% of which are HOT updates).
Corruption was found on all shards, but the list of affected indexes a bit varies from shard to shard.
Database schema:
- mostly PRIMARY or UNIQUE keys
- a couple of non-unique btree indexes
- plenty of foreign keys
The timeline:
2021-10-11 - we did the major upgrade from 9.6 to 14
2021-10-14 - executed reindexdb -a --concurrently, which finished successfully. In order to speed up reindexing we were using PGOPTIONS="-c maintenance_work_mem=64GB -c max_parallel_maintenance_workers=4"
2021-10-25 - I noticed that some of the indexes are corrupted, and these are mostly UNIQUE indexes on int and/or bigint.
After that, I identified affected indexes with amcheck, found and removed duplicated rows, and run pg_repack on affected tables. The pg_repack was running with max_parallel_maintenance_workers=0
Since we keep an archive of WALs and backups only for the past 6 days it would not be possible to find respective files that produced the corruption.
As of today (2021-10-29), amcheck doesn't report any problems.
I hope this information could give you some hints.
Regards,
--
Alexander Kukushkin