Обсуждение: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
От
Kristjan Mustkivi
Дата:
Dear community, Right after upgrading our postgres servers from 11.15 to 11.17 we started to encounter problems with data. Namely, when the query hit the index, it returned a single row; when the query hit a relation directly, it returned more than one row. Attempt to REINDEX revealed the underlying data had duplicates (unique index reindexing failed). Version facts: we started out with 11.12 jan 2022 upgraded to 11.14 mar 2022 to 11.15 oct 2022 to 11.17 We are not sure when this corruption actually happened. Could it be related to the indexing bugs reported in https://www.postgresql.org/docs/release/11.14/? And the condition only became known to us after 11.17 rollout which can perhaps be explained by the following: while 11.17 does not have any outstanding index related fixes, then https://www.postgresql.org/docs/release/11.15/ mentions fix for index-only scans and so does https://www.postgresql.org/docs/release/11.16/. The bottom line is we would like to understand if the index corruption and its manifestation is explained by the above release fixes or is there something else that should be investigated further here with the help from the community. With best regards, -- Kristjan Mustkivi Email: kristjan.mustkivi@gmail.com
On Thu, Oct 27, 2022 at 10:20 AM Kristjan Mustkivi <sonicmonkey@gmail.com> wrote:
Dear community,
Right after upgrading our postgres servers from 11.15 to 11.17 we
started to encounter problems with data. Namely, when the query hit
the index, it returned a single row; when the query hit a relation
directly, it returned more than one row. Attempt to REINDEX revealed
the underlying data had duplicates (unique index reindexing failed).
Version facts:
we started out with 11.12
jan 2022 upgraded to 11.14
mar 2022 to 11.15
oct 2022 to 11.17
We are not sure when this corruption actually happened. Could it be
related to the indexing bugs reported in
https://www.postgresql.org/docs/release/11.14/? And the condition only
became known to us after 11.17 rollout which can perhaps be explained
by the following: while 11.17 does not have any outstanding index
related fixes, then https://www.postgresql.org/docs/release/11.15/
mentions fix for index-only scans and so does
https://www.postgresql.org/docs/release/11.16/.
The bottom line is we would like to understand if the index corruption
and its manifestation is explained by the above release fixes or is
there something else that should be investigated further here with the
help from the community.
With best regards,
--
Kristjan Mustkivi
Email: kristjan.mustkivi@gmail.com
Hi Kristjan,
What if you construct a select statement containing the row id and the column which has the problematic index into a new table. Then perform queries on this table to test for uniqueness of the column on which the problematic index was reported.
Allan.
On Thu, Oct 27, 2022 at 10:26 AM Allan Kamau <kamauallan@gmail.com> wrote:
On Thu, Oct 27, 2022 at 10:20 AM Kristjan Mustkivi <sonicmonkey@gmail.com> wrote:Dear community,
Right after upgrading our postgres servers from 11.15 to 11.17 we
started to encounter problems with data. Namely, when the query hit
the index, it returned a single row; when the query hit a relation
directly, it returned more than one row. Attempt to REINDEX revealed
the underlying data had duplicates (unique index reindexing failed).
Version facts:
we started out with 11.12
jan 2022 upgraded to 11.14
mar 2022 to 11.15
oct 2022 to 11.17
We are not sure when this corruption actually happened. Could it be
related to the indexing bugs reported in
https://www.postgresql.org/docs/release/11.14/? And the condition only
became known to us after 11.17 rollout which can perhaps be explained
by the following: while 11.17 does not have any outstanding index
related fixes, then https://www.postgresql.org/docs/release/11.15/
mentions fix for index-only scans and so does
https://www.postgresql.org/docs/release/11.16/.
The bottom line is we would like to understand if the index corruption
and its manifestation is explained by the above release fixes or is
there something else that should be investigated further here with the
help from the community.
With best regards,
--
Kristjan Mustkivi
Email: kristjan.mustkivi@gmail.comHi Kristjan,What if you construct a select statement containing the row id and the column which has the problematic index into a new table. Then perform queries on this table to test for uniqueness of the column on which the problematic index was reported.Allan.
How was the data "transfer" between upgrades done? Was it by dump and restore?
If you have the 11.15 instance running having the data, you may do the selection of the row id and the specific column which the index is based into a new table and perform queries on this too to determine uniqueness of the values therein. Likewise do the same for the 11.17 version.
Is it possible to build and install PG 15 from source on a different directory (using --prefix ) then perform pg_dump using the binaries of this installation into a directory. Then configure PG 15 installation to listen on a different TCP/IP port to the one you are currently using with 11.17 instance. Once started, test to see if the index anomaly is present in the PG 15 instance. Alternatively you may use the PG 15 docker image and docker to start a PG 15 docker container for your tests instead of having to build and install PG 15 for this test.
-Allan
Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
От
Kristjan Mustkivi
Дата:
Hi Allan, We use dockerized postgres. So the upgrade is just replacing the old minor version pg docker image with new minor version docker image and that's it. Of course, I am checking the release notes to see if there is anything to pay attention to particularly. I do apologize, but I do not understand the value of doing that select juggling. I have been searching for and fixing the problematic rows with ctid (and xmin, xmax to help establish the ones to remove) and this has been effective in indicating the discrepancy between actual data in the table and the corruption in the e.g Primary Key index. Also, playing around with enable_indexscan, enable_bitmapscan, enable_seqscan prove the same problem with an index. After deleting the offending rows by ctid, REINDEX-ing is possible. (As these are production systems, some of the relations had to be fixed right away.) This case is most likely to do with some bug and the release notes for 11.14, .15 and .16 seem to explain the current situation. What would be good to know is that this particular bug has been fixed and is not one yet to be uncovered. Best regards, Kristjan On Thu, Oct 27, 2022 at 10:41 AM Allan Kamau <kamauallan@gmail.com> wrote: > > > > On Thu, Oct 27, 2022 at 10:26 AM Allan Kamau <kamauallan@gmail.com> wrote: >> >> >> >> On Thu, Oct 27, 2022 at 10:20 AM Kristjan Mustkivi <sonicmonkey@gmail.com> wrote: >>> >>> Dear community, >>> >>> Right after upgrading our postgres servers from 11.15 to 11.17 we >>> started to encounter problems with data. Namely, when the query hit >>> the index, it returned a single row; when the query hit a relation >>> directly, it returned more than one row. Attempt to REINDEX revealed >>> the underlying data had duplicates (unique index reindexing failed). >>> >>> Version facts: >>> we started out with 11.12 >>> jan 2022 upgraded to 11.14 >>> mar 2022 to 11.15 >>> oct 2022 to 11.17 >>> >>> We are not sure when this corruption actually happened. Could it be >>> related to the indexing bugs reported in >>> https://www.postgresql.org/docs/release/11.14/? And the condition only >>> became known to us after 11.17 rollout which can perhaps be explained >>> by the following: while 11.17 does not have any outstanding index >>> related fixes, then https://www.postgresql.org/docs/release/11.15/ >>> mentions fix for index-only scans and so does >>> https://www.postgresql.org/docs/release/11.16/. >>> >>> The bottom line is we would like to understand if the index corruption >>> and its manifestation is explained by the above release fixes or is >>> there something else that should be investigated further here with the >>> help from the community. >>> >>> With best regards, >>> -- >>> Kristjan Mustkivi >>> >>> Email: kristjan.mustkivi@gmail.com >>> >>> >> Hi Kristjan, >> What if you construct a select statement containing the row id and the column which has the problematic index into a newtable. Then perform queries on this table to test for uniqueness of the column on which the problematic index was reported. >> >> Allan. > > > How was the data "transfer" between upgrades done? Was it by dump and restore? > If you have the 11.15 instance running having the data, you may do the selection of the row id and the specific columnwhich the index is based into a new table and perform queries on this too to determine uniqueness of the values therein.Likewise do the same for the 11.17 version. > > Is it possible to build and install PG 15 from source on a different directory (using --prefix ) then perform pg_dump usingthe binaries of this installation into a directory. Then configure PG 15 installation to listen on a different TCP/IPport to the one you are currently using with 11.17 instance. Once started, test to see if the index anomaly is presentin the PG 15 instance. Alternatively you may use the PG 15 docker image and docker to start a PG 15 docker containerfor your tests instead of having to build and install PG 15 for this test. > > -Allan > > > -- Kristjan Mustkivi Email: kristjan.mustkivi@gmail.com
Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
От
"Peter J. Holzer"
Дата:
On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote: > We use dockerized postgres. So that means you aren't just replacing PostgreSQL, but your complete OS (except the kernel). What is the source of your docker images? Do they all use the same base OS distribution? Are the locale definitions the same? (Just trying to rule other other possible error sources.) > I do apologize, but I do not understand the value of doing that select > juggling. I think Allan may have misread your mail. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Вложения
On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote:
> We use dockerized postgres.
So that means you aren't just replacing PostgreSQL, but your complete OS
(except the kernel). What is the source of your docker images? Do they
all use the same base OS distribution? Are the locale definitions the
same?
(Just trying to rule other other possible error sources.)
> I do apologize, but I do not understand the value of doing that select
> juggling.
I think Allan may have misread your mail.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Hi Peter,
It was a long shot. I was trying to suggest determining if indeed the data does actually contain duplicating values in the mentioned columns.
Then test to see if the index construction on the new table can be successful on both versions and see if the error is reproducible.
-Allan.
Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
От
Kristjan Mustkivi
Дата:
On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > > On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote: > > We use dockerized postgres. > > So that means you aren't just replacing PostgreSQL, but your complete OS > (except the kernel). What is the source of your docker images? Do they > all use the same base OS distribution? Are the locale definitions the > same? > > (Just trying to rule other other possible error sources.) Hello! Up until 11.17, the source of the docker images was tag "postgres:11" (from https://hub.docker.com/_/postgres), for 11.17 the tag became "postgres:11-bullseye" but as far as i could tell it was just a difference of tagging policy there. Everything else is kept the same when building our custom docker image (with pg_cron, wal2json and oracle_fdw). But.. I can see for example, that the PG 11.12 docker image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to /etc/debian_version). So the official docker images also upgrade the OS (which is expected I suppose). Best regards, -- Kristjan Mustkivi Email: kristjan.mustkivi@gmail.com
On 10/27/22 07:07, Kristjan Mustkivi wrote: > On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote: >> On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote: >>> We use dockerized postgres. >> So that means you aren't just replacing PostgreSQL, but your complete OS >> (except the kernel). What is the source of your docker images? Do they >> all use the same base OS distribution? Are the locale definitions the >> same? >> >> (Just trying to rule other other possible error sources.) > Hello! > > Up until 11.17, the source of the docker images was tag "postgres:11" > (from https://hub.docker.com/_/postgres), for 11.17 the tag became > "postgres:11-bullseye" but as far as i could tell it was just a > difference of tagging policy there. Everything else is kept the same > when building our custom docker image (with pg_cron, wal2json and > oracle_fdw). But.. I can see for example, that the PG 11.12 docker > image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to > /etc/debian_version). So the official docker images also upgrade the > OS (which is expected I suppose). Differing locales is a strong possibility. -- Angular momentum makes the world go 'round.
Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
От
"Peter J. Holzer"
Дата:
On 2022-10-27 15:07:06 +0300, Kristjan Mustkivi wrote: > On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > > On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote: > > > We use dockerized postgres. > > > > So that means you aren't just replacing PostgreSQL, but your complete OS > > (except the kernel). What is the source of your docker images? Do they > > all use the same base OS distribution? Are the locale definitions the > > same? > > > > (Just trying to rule other other possible error sources.) > > Up until 11.17, the source of the docker images was tag "postgres:11" > (from https://hub.docker.com/_/postgres), for 11.17 the tag became > "postgres:11-bullseye" but as far as i could tell it was just a > difference of tagging policy there. Everything else is kept the same > when building our custom docker image (with pg_cron, wal2json and > oracle_fdw). But.. I can see for example, that the PG 11.12 docker > image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to > /etc/debian_version). Ok, So that's an ugrade from Debian 9 to Debian 11. That's definitely not just a "difference of tagging policy", That's two major versions of the OS! I don't remember exactly when the big incompatible libc upgrade was, but it was very likely somewhere between Debian 9 and Debian 11, so you have to rebuild all you indexes. Since you didn't do that immediately after the upgrade you now have data corruption which you have to fix manually. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Вложения
Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
От
Kristjan Mustkivi
Дата:
On Fri, Oct 28, 2022 at 2:41 AM Peter J. Holzer <hjp-pgsql@hjp.at> wrote: > > > Up until 11.17, the source of the docker images was tag "postgres:11" > > (from https://hub.docker.com/_/postgres), for 11.17 the tag became > > "postgres:11-bullseye" but as far as i could tell it was just a > > difference of tagging policy there. Everything else is kept the same > > when building our custom docker image (with pg_cron, wal2json and > > oracle_fdw). But.. I can see for example, that the PG 11.12 docker > > image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to > > /etc/debian_version). > > Ok, So that's an ugrade from Debian 9 to Debian 11. That's definitely > not just a "difference of tagging policy", That's two major versions of > the OS! > > I don't remember exactly when the big incompatible libc upgrade was, but > it was very likely somewhere between Debian 9 and Debian 11, so you have > to rebuild all you indexes. Since you didn't do that immediately after > the upgrade you now have data corruption which you have to fix manually. Well, I am going to remember this lesson for the rest of my life. After using this same scheme for upgrading the pg-s for over 3 years, this really caught me as a surprise - apparently I got comfortable and I did not pay close attention to this particular change with the docker hub images. We have reverted back to our previous version as the comparison based on queries showed the data to be more intact with it. By the way, index rebuild while completing successfully did not fix the indexes - the data in the tables was still missing even after the successful rebuild command. I guess the only option left is to drop and re-create these one by one. Thank you all for feedback and help! With best regards, -- Kristjan Mustkivi Email: kristjan.mustkivi@gmail.com
Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12
От
"Peter J. Holzer"
Дата:
On 2022-10-28 17:23:02 +0300, Kristjan Mustkivi wrote: > By the way, index rebuild while completing successfully did not fix > the indexes - the data in the tables was still missing even after the > successful rebuild command. This is surprising. As I understand it, REINDEX scans the complete table and builds a new index. So after a REINDEX the index should be complete and usable. However, reading https://www.postgresql.org/docs/current/sql-reindex.html I'm unsure what happens if you have issued a REINDEX for a table, schema, etc. and rebuilding an index fails: Does it abort at that time or does it just issue a notice and continue to rebuild the other indexes? In the latter case it migh be easy to miss a problem. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Вложения
On 10/28/22 16:24, Peter J. Holzer wrote: > On 2022-10-28 17:23:02 +0300, Kristjan Mustkivi wrote: >> By the way, index rebuild while completing successfully did not fix >> the indexes - the data in the tables was still missing even after the >> successful rebuild command. > > This is surprising. As I understand it, REINDEX scans the complete table > and builds a new index. So after a REINDEX the index should be complete > and usable. > > However, reading > https://www.postgresql.org/docs/current/sql-reindex.html I'm unsure what > happens if you have issued a REINDEX for a table, schema, etc. and > rebuilding an index fails: Does it abort at that time or does it just > issue a notice and continue to rebuild the other indexes? In the latter > case it migh be easy to miss a problem. Debian 9 appears to be glibc 2.24 and Debian 11 is glibc 2.31. Therefore the issue here is almost certainly corrupt indexes due to the glibc changes. Note that some forms of the problem can manifest in primary keys or unique indexes that were valid in the original collation behavior, but contained duplicate rows according to the new collation. Those indexes would not be able to be reindexed without manual intervention. I don't know offhand what happens if you concurrently reindex an entire database and just a few indexes fail to rebuild, but I suspect as you mentioned above the failure might easily be missed and invalid indexes would be left behind. I think you will find both indisready and indisvalid false for these indexes. HTH, -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com