Обсуждение: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

Поиск
Список
Период
Сортировка

Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Kristjan Mustkivi
Дата:
Dear community,

Right after upgrading our postgres servers from 11.15 to 11.17 we
started to encounter problems with data. Namely, when the query hit
the index, it returned a single row; when the query hit a relation
directly, it returned more than one row. Attempt to REINDEX revealed
the underlying data had duplicates (unique index reindexing failed).

Version facts:
we started out with 11.12
jan 2022 upgraded to 11.14
mar 2022 to 11.15
oct 2022 to 11.17

We are not sure when this corruption actually happened. Could it be
related to the indexing bugs reported in
https://www.postgresql.org/docs/release/11.14/? And the condition only
became known to us after 11.17 rollout which can perhaps be explained
by the following: while 11.17 does not have any outstanding index
related fixes, then https://www.postgresql.org/docs/release/11.15/
mentions fix for index-only scans and so does
https://www.postgresql.org/docs/release/11.16/.

The bottom line is we would like to understand if the index corruption
and its manifestation is explained by the above release fixes or is
there something else that should be investigated further here with the
help from the community.

With best regards,
-- 
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com



Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Allan Kamau
Дата:


On Thu, Oct 27, 2022 at 10:20 AM Kristjan Mustkivi <sonicmonkey@gmail.com> wrote:
Dear community,

Right after upgrading our postgres servers from 11.15 to 11.17 we
started to encounter problems with data. Namely, when the query hit
the index, it returned a single row; when the query hit a relation
directly, it returned more than one row. Attempt to REINDEX revealed
the underlying data had duplicates (unique index reindexing failed).

Version facts:
we started out with 11.12
jan 2022 upgraded to 11.14
mar 2022 to 11.15
oct 2022 to 11.17

We are not sure when this corruption actually happened. Could it be
related to the indexing bugs reported in
https://www.postgresql.org/docs/release/11.14/? And the condition only
became known to us after 11.17 rollout which can perhaps be explained
by the following: while 11.17 does not have any outstanding index
related fixes, then https://www.postgresql.org/docs/release/11.15/
mentions fix for index-only scans and so does
https://www.postgresql.org/docs/release/11.16/.

The bottom line is we would like to understand if the index corruption
and its manifestation is explained by the above release fixes or is
there something else that should be investigated further here with the
help from the community.

With best regards,
--
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com


Hi Kristjan,
What if you construct a select statement containing the row id and the column which has the problematic index into a new table. Then perform queries on this table to test for uniqueness of the column on which the problematic index was reported.

Allan. 

Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Allan Kamau
Дата:


On Thu, Oct 27, 2022 at 10:26 AM Allan Kamau <kamauallan@gmail.com> wrote:


On Thu, Oct 27, 2022 at 10:20 AM Kristjan Mustkivi <sonicmonkey@gmail.com> wrote:
Dear community,

Right after upgrading our postgres servers from 11.15 to 11.17 we
started to encounter problems with data. Namely, when the query hit
the index, it returned a single row; when the query hit a relation
directly, it returned more than one row. Attempt to REINDEX revealed
the underlying data had duplicates (unique index reindexing failed).

Version facts:
we started out with 11.12
jan 2022 upgraded to 11.14
mar 2022 to 11.15
oct 2022 to 11.17

We are not sure when this corruption actually happened. Could it be
related to the indexing bugs reported in
https://www.postgresql.org/docs/release/11.14/? And the condition only
became known to us after 11.17 rollout which can perhaps be explained
by the following: while 11.17 does not have any outstanding index
related fixes, then https://www.postgresql.org/docs/release/11.15/
mentions fix for index-only scans and so does
https://www.postgresql.org/docs/release/11.16/.

The bottom line is we would like to understand if the index corruption
and its manifestation is explained by the above release fixes or is
there something else that should be investigated further here with the
help from the community.

With best regards,
--
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com


Hi Kristjan,
What if you construct a select statement containing the row id and the column which has the problematic index into a new table. Then perform queries on this table to test for uniqueness of the column on which the problematic index was reported.

Allan. 

How was the data "transfer" between upgrades done? Was it by dump and restore?
If you have the 11.15 instance running having the data, you may do the selection of the row id and the specific column which the index is based into a new table and perform queries on this too to determine uniqueness of the values therein. Likewise do the same for the 11.17 version.

Is it possible to build and install PG 15 from source on a different directory (using --prefix ) then perform pg_dump using the binaries of this installation into a directory. Then configure PG 15 installation to listen on a different TCP/IP port to the one you are currently using with 11.17 instance. Once started, test to see if the index anomaly is present in the PG 15 instance. Alternatively you may use the PG 15 docker image and docker to start a PG 15 docker container for your tests instead of having to build and install PG 15 for this test.

-Allan



Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Kristjan Mustkivi
Дата:
Hi Allan,

We use dockerized postgres. So the upgrade is just replacing the old
minor version pg docker image with new minor version docker image and
that's it. Of course, I am checking the release notes to see if there
is anything to pay attention to particularly.

I do apologize, but I do not understand the value of doing that select
juggling. I have been searching for and fixing the problematic rows
with ctid (and xmin, xmax  to help establish the ones to remove) and
this has been effective in indicating the discrepancy between actual
data in the table and the corruption in the e.g Primary Key index.
Also, playing around with enable_indexscan, enable_bitmapscan,
enable_seqscan prove the same problem with an index. After deleting
the offending rows by ctid, REINDEX-ing is possible. (As these are
production systems, some of the relations had to be fixed right away.)

This case is most likely to do with some bug and the release notes for
11.14, .15 and .16 seem to explain the current situation. What would
be good to know is that this particular bug has been fixed and is not
one yet to be uncovered.

Best regards,

Kristjan

On Thu, Oct 27, 2022 at 10:41 AM Allan Kamau <kamauallan@gmail.com> wrote:
>
>
>
> On Thu, Oct 27, 2022 at 10:26 AM Allan Kamau <kamauallan@gmail.com> wrote:
>>
>>
>>
>> On Thu, Oct 27, 2022 at 10:20 AM Kristjan Mustkivi <sonicmonkey@gmail.com> wrote:
>>>
>>> Dear community,
>>>
>>> Right after upgrading our postgres servers from 11.15 to 11.17 we
>>> started to encounter problems with data. Namely, when the query hit
>>> the index, it returned a single row; when the query hit a relation
>>> directly, it returned more than one row. Attempt to REINDEX revealed
>>> the underlying data had duplicates (unique index reindexing failed).
>>>
>>> Version facts:
>>> we started out with 11.12
>>> jan 2022 upgraded to 11.14
>>> mar 2022 to 11.15
>>> oct 2022 to 11.17
>>>
>>> We are not sure when this corruption actually happened. Could it be
>>> related to the indexing bugs reported in
>>> https://www.postgresql.org/docs/release/11.14/? And the condition only
>>> became known to us after 11.17 rollout which can perhaps be explained
>>> by the following: while 11.17 does not have any outstanding index
>>> related fixes, then https://www.postgresql.org/docs/release/11.15/
>>> mentions fix for index-only scans and so does
>>> https://www.postgresql.org/docs/release/11.16/.
>>>
>>> The bottom line is we would like to understand if the index corruption
>>> and its manifestation is explained by the above release fixes or is
>>> there something else that should be investigated further here with the
>>> help from the community.
>>>
>>> With best regards,
>>> --
>>> Kristjan Mustkivi
>>>
>>> Email: kristjan.mustkivi@gmail.com
>>>
>>>
>> Hi Kristjan,
>> What if you construct a select statement containing the row id and the column which has the problematic index into a
newtable. Then perform queries on this table to test for uniqueness of the column on which the problematic index was
reported.
>>
>> Allan.
>
>
> How was the data "transfer" between upgrades done? Was it by dump and restore?
> If you have the 11.15 instance running having the data, you may do the selection of the row id and the specific
columnwhich the index is based into a new table and perform queries on this too to determine uniqueness of the values
therein.Likewise do the same for the 11.17 version. 
>
> Is it possible to build and install PG 15 from source on a different directory (using --prefix ) then perform pg_dump
usingthe binaries of this installation into a directory. Then configure PG 15 installation to listen on a different
TCP/IPport to the one you are currently using with 11.17 instance. Once started, test to see if the index anomaly is
presentin the PG 15 instance. Alternatively you may use the PG 15 docker image and docker to start a PG 15 docker
containerfor your tests instead of having to build and install PG 15 for this test. 
>
> -Allan
>
>
>


--
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com



Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
"Peter J. Holzer"
Дата:
On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote:
> We use dockerized postgres.

So that means you aren't just replacing PostgreSQL, but your complete OS
(except the kernel). What is the source of your docker images? Do they
all use the same base OS distribution? Are the locale definitions the
same?

(Just trying to rule other other possible error sources.)


> I do apologize, but I do not understand the value of doing that select
> juggling.

I think Allan may have misread your mail.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Allan Kamau
Дата:


On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote:
> We use dockerized postgres.

So that means you aren't just replacing PostgreSQL, but your complete OS
(except the kernel). What is the source of your docker images? Do they
all use the same base OS distribution? Are the locale definitions the
same?

(Just trying to rule other other possible error sources.)


> I do apologize, but I do not understand the value of doing that select
> juggling.

I think Allan may have misread your mail.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"



Hi Peter,

It was a long shot. I was trying to suggest determining if indeed the data does actually contain duplicating values in the mentioned columns. 
Then test to see if the index construction on the new table can be successful on both versions and see if the error is reproducible.

-Allan.

Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Kristjan Mustkivi
Дата:
On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
> On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote:
> > We use dockerized postgres.
>
> So that means you aren't just replacing PostgreSQL, but your complete OS
> (except the kernel). What is the source of your docker images? Do they
> all use the same base OS distribution? Are the locale definitions the
> same?
>
> (Just trying to rule other other possible error sources.)

Hello!

Up until 11.17, the source of the docker images was tag "postgres:11"
(from https://hub.docker.com/_/postgres), for 11.17 the tag became
"postgres:11-bullseye" but as far as i could tell it was just a
difference of tagging policy there. Everything else is kept the same
when building our custom docker image (with pg_cron, wal2json and
oracle_fdw). But.. I can see for example, that the PG 11.12 docker
image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to
/etc/debian_version). So the official docker images also upgrade the
OS (which is expected I suppose).

Best regards,
-- 
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com



Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Ron
Дата:
On 10/27/22 07:07, Kristjan Mustkivi wrote:
> On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>> On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote:
>>> We use dockerized postgres.
>> So that means you aren't just replacing PostgreSQL, but your complete OS
>> (except the kernel). What is the source of your docker images? Do they
>> all use the same base OS distribution? Are the locale definitions the
>> same?
>>
>> (Just trying to rule other other possible error sources.)
> Hello!
>
> Up until 11.17, the source of the docker images was tag "postgres:11"
> (from https://hub.docker.com/_/postgres), for 11.17 the tag became
> "postgres:11-bullseye" but as far as i could tell it was just a
> difference of tagging policy there. Everything else is kept the same
> when building our custom docker image (with pg_cron, wal2json and
> oracle_fdw). But.. I can see for example, that the PG 11.12 docker
> image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to
> /etc/debian_version). So the official docker images also upgrade the
> OS (which is expected I suppose).

Differing locales is a strong possibility.

-- 
Angular momentum makes the world go 'round.



Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
"Peter J. Holzer"
Дата:
On 2022-10-27 15:07:06 +0300, Kristjan Mustkivi wrote:
> On Thu, Oct 27, 2022 at 12:18 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > On 2022-10-27 10:55:31 +0300, Kristjan Mustkivi wrote:
> > > We use dockerized postgres.
> >
> > So that means you aren't just replacing PostgreSQL, but your complete OS
> > (except the kernel). What is the source of your docker images? Do they
> > all use the same base OS distribution? Are the locale definitions the
> > same?
> >
> > (Just trying to rule other other possible error sources.)
>
> Up until 11.17, the source of the docker images was tag "postgres:11"
> (from https://hub.docker.com/_/postgres), for 11.17 the tag became
> "postgres:11-bullseye" but as far as i could tell it was just a
> difference of tagging policy there. Everything else is kept the same
> when building our custom docker image (with pg_cron, wal2json and
> oracle_fdw). But.. I can see for example, that the PG 11.12 docker
> image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to
> /etc/debian_version).

Ok, So that's an ugrade from Debian 9 to Debian 11. That's definitely
not just a "difference of tagging policy", That's two major versions of
the OS!

I don't remember exactly when the big incompatible libc upgrade was, but
it was very likely somewhere between Debian 9 and Debian 11, so you have
to rebuild all you indexes. Since you didn't do that immediately after
the upgrade you now have data corruption which you have to fix manually.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Kristjan Mustkivi
Дата:
On Fri, Oct 28, 2022 at 2:41 AM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
> > Up until 11.17, the source of the docker images was tag "postgres:11"
> > (from https://hub.docker.com/_/postgres), for 11.17 the tag became
> > "postgres:11-bullseye" but as far as i could tell it was just a
> > difference of tagging policy there. Everything else is kept the same
> > when building our custom docker image (with pg_cron, wal2json and
> > oracle_fdw). But.. I can see for example, that the PG 11.12 docker
> > image used Debian 9.13 (PG 11.17 uses Debian 11.5 according to
> > /etc/debian_version).
>
> Ok, So that's an ugrade from Debian 9 to Debian 11. That's definitely
> not just a "difference of tagging policy", That's two major versions of
> the OS!
>
> I don't remember exactly when the big incompatible libc upgrade was, but
> it was very likely somewhere between Debian 9 and Debian 11, so you have
> to rebuild all you indexes. Since you didn't do that immediately after
> the upgrade you now have data corruption which you have to fix manually.

Well, I am going to remember this lesson for the rest of my life.
After using this same scheme for upgrading the pg-s for over 3 years,
this really caught me as a surprise - apparently I got comfortable and
I did not pay close attention to this particular change with the
docker hub images.

We have reverted back to our previous version as the comparison based
on queries showed the data to be more intact with it.

By the way, index rebuild while completing successfully did not fix
the indexes - the data in the tables was still missing even after the
successful rebuild command. I guess the only option left is to drop
and re-create these one by one.

Thank you all for feedback and help!

With best regards,
-- 
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com



Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
"Peter J. Holzer"
Дата:
On 2022-10-28 17:23:02 +0300, Kristjan Mustkivi wrote:
> By the way, index rebuild while completing successfully did not fix
> the indexes - the data in the tables was still missing even after the
> successful rebuild command.

This is surprising. As I understand it, REINDEX scans the complete table
and builds a new index. So after a REINDEX the index should be complete
and usable.

However, reading
https://www.postgresql.org/docs/current/sql-reindex.html I'm unsure what
happens if you have issued a REINDEX for a table, schema, etc. and
rebuilding an index fails: Does it abort at that time or does it just
issue a notice and continue to rebuild the other indexes? In the latter
case it migh be easy to miss a problem.

        hp


--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Вложения

Re: Index corruption revealed after upgrade to 11.17, could date back to at least 11.12

От
Joe Conway
Дата:
On 10/28/22 16:24, Peter J. Holzer wrote:
> On 2022-10-28 17:23:02 +0300, Kristjan Mustkivi wrote:
>> By the way, index rebuild while completing successfully did not fix
>> the indexes - the data in the tables was still missing even after the
>> successful rebuild command.
> 
> This is surprising. As I understand it, REINDEX scans the complete table
> and builds a new index. So after a REINDEX the index should be complete
> and usable.
> 
> However, reading
> https://www.postgresql.org/docs/current/sql-reindex.html I'm unsure what
> happens if you have issued a REINDEX for a table, schema, etc. and
> rebuilding an index fails: Does it abort at that time or does it just
> issue a notice and continue to rebuild the other indexes? In the latter
> case it migh be easy to miss a problem.

Debian 9 appears to be glibc 2.24 and Debian 11 is glibc 2.31. Therefore 
the issue here is almost certainly corrupt indexes due to the glibc changes.

Note that some forms of the problem can manifest in primary keys or 
unique indexes that were valid in the original collation behavior, but 
contained duplicate rows according to the new collation. Those indexes 
would not be able to be reindexed without manual intervention.

I don't know offhand what happens if you concurrently reindex an entire 
database and just a few indexes fail to rebuild, but I suspect as you 
mentioned above the failure might easily be missed and invalid indexes 
would be left behind. I think you will find both indisready and 
indisvalid false for these indexes.

HTH,

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com