Обсуждение: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

Поиск
Список
Период
Сортировка

BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16792
Logged by:          pawel kudzia
Email address:      kudzia@gmail.com
PostgreSQL version: 11.10
Operating system:   Debian Buster x86_64
Description:

I'd like to ask for advice how to proceed with pin-pointing cause of the
silent corruption of GIN index that I'm facing. 

Relevant part of the table structure:

CREATE TABLE public.entity (
    entity_id bigint NOT NULL,
    attribute_value_ids integer[] NOT NULL,
    attribute_name_ids integer[] NOT NULL,
[..]
);

ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY
(entity_id);
CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids public.gin__int_ops);
CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin
(attribute_value_ids public.gin__int_ops);

How does the issue manifest?

Queries which use GIN on integer[] column occasionally return too many rows,
including ones that actually do not match criteria expressed in WHERE.
Queries like below should never return any rows, yet - occasionally they do
return some results:

data=> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{4980}' )
AND NOT ( (attribute_name_ids||0) && '{4980}') ;
 entity_id
-----------
  31213924
  31195117
  31209184
(3 rows)

Query plan:
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on entity  (cost=50.39..3068.11 rows=26923 width=22)
   Recheck Cond: (attribute_name_ids && '{4980}'::integer[])
   Filter: (NOT ((attribute_name_ids || 0) && '{4980}'::integer[]))
   ->  Bitmap Index Scan on entity_attribute_name_ids_gin  (cost=0.00..50.39
rows=27058 width=0)
         Index Cond: (attribute_name_ids && '{4980}'::integer[])
(5 rows)

The query was crafted intentionally to let PostgreSQL use GIN index
entity_attribute_name_ids_gin [ left side of AND ] and then filter out rows
that after inspection of their values do not actually match that criteria.

What have I done so far?

Originally the problem was discovered when running PostgreSQL 11.7 from
Debian's repo. In the first step we've upgraded to PostgreSQL 11.9 also from
Debian's repo and run:

reindex (verbose) table entity; vacuum (verbose, full, analyze) entity;

After few days of updates the problem returned.

We've upgraded to PostgreSQL 11.10 from postgresql.org repository for Debian
and reindexed / vacuumed again. After few weeks of updates problem returned
again.

Other information worth noting:
* table in question has 38M rows and is the only table created after
PostgreSQL installation
* server in question is master of streaming replication; issue occurs also
on the slave servers - this suggests corruption of the on-disk GIN index
data
* just rewriting rows falsely returned by select above - UPDATE entity SET
attribute_name_ids ='{....}' WHERE entity_id=123 - fixes the issue
* we've set up rudimentary consistency checks that allow to catch some of
the corruptions within 1h of occurring - so far corruptions only happen to
rows that have been updated since the previous check
* server in question handles heavy read/write traffic
* PostgreSQL in question runs in LXC container with Debian Buster running on
top of Debian Buster running on top of bare-metal server
* it's highly unlikely that hardware malfunction is to be blamed - the same
physical server handles also few MySQL instances with hundreds of GB of data
with heavy consistency checks cross checking content of MySQL with
PostgreSQL, unpacking gzip'ed blobs stored in MySQL databases and
de-serializing objects stored there. if there was a bit-rot / bit-flip in
memory would that's not detected or fixed by ECC - high level consistency
checks would pick it.
* despite numerous attempts i cannot reproduce the issue in test
environment, possibly due to much lower level of write traffic. it takes
days to weeks to have it re-occur on the production server.

Thank you in advance for your suggestions how to tackle this.

best regards,
Pawel


Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Pawel Kudzia
Дата:
On Sun, Dec 27, 2020 at 6:19 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      16792
> Logged by:          pawel kudzia
> Email address:      kudzia@gmail.com
> PostgreSQL version: 11.10
> Operating system:   Debian Buster x86_64
> Description:
>
> I'd like to ask for advice how to proceed with pin-pointing cause of the
> silent corruption of GIN index that I'm facing.
>
> Relevant part of the table structure:
>
> CREATE TABLE public.entity (
>     entity_id bigint NOT NULL,
>     attribute_value_ids integer[] NOT NULL,
>     attribute_name_ids integer[] NOT NULL,
> [..]
> );
>
> ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY
> (entity_id);
> CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
> (attribute_name_ids public.gin__int_ops);
> CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin
> (attribute_value_ids public.gin__int_ops);
>
> How does the issue manifest?
>
> Queries which use GIN on integer[] column occasionally return too many rows,
> including ones that actually do not match criteria expressed in WHERE.
> Queries like below should never return any rows, yet - occasionally they do
> return some results:
>
> data=> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{4980}' )
> AND NOT ( (attribute_name_ids||0) && '{4980}') ;
>  entity_id
> -----------
>   31213924
>   31195117
>   31209184
> (3 rows)
>
> Query plan:
>                                            QUERY PLAN
> -------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on entity  (cost=50.39..3068.11 rows=26923 width=22)
>    Recheck Cond: (attribute_name_ids && '{4980}'::integer[])
>    Filter: (NOT ((attribute_name_ids || 0) && '{4980}'::integer[]))
>    ->  Bitmap Index Scan on entity_attribute_name_ids_gin  (cost=0.00..50.39
> rows=27058 width=0)
>          Index Cond: (attribute_name_ids && '{4980}'::integer[])
> (5 rows)
>
> The query was crafted intentionally to let PostgreSQL use GIN index
> entity_attribute_name_ids_gin [ left side of AND ] and then filter out rows
> that after inspection of their values do not actually match that criteria.
>
> What have I done so far?
>
> Originally the problem was discovered when running PostgreSQL 11.7 from
> Debian's repo. In the first step we've upgraded to PostgreSQL 11.9 also from
> Debian's repo and run:
>
> reindex (verbose) table entity; vacuum (verbose, full, analyze) entity;
>
> After few days of updates the problem returned.
>
> We've upgraded to PostgreSQL 11.10 from postgresql.org repository for Debian
> and reindexed / vacuumed again. After few weeks of updates problem returned
> again.
>
> Other information worth noting:
> * table in question has 38M rows and is the only table created after
> PostgreSQL installation
> * server in question is master of streaming replication; issue occurs also
> on the slave servers - this suggests corruption of the on-disk GIN index
> data
> * just rewriting rows falsely returned by select above - UPDATE entity SET
> attribute_name_ids ='{....}' WHERE entity_id=123 - fixes the issue
> * we've set up rudimentary consistency checks that allow to catch some of
> the corruptions within 1h of occurring - so far corruptions only happen to
> rows that have been updated since the previous check
> * server in question handles heavy read/write traffic
> * PostgreSQL in question runs in LXC container with Debian Buster running on
> top of Debian Buster running on top of bare-metal server
> * it's highly unlikely that hardware malfunction is to be blamed - the same
> physical server handles also few MySQL instances with hundreds of GB of data
> with heavy consistency checks cross checking content of MySQL with
> PostgreSQL, unpacking gzip'ed blobs stored in MySQL databases and
> de-serializing objects stored there. if there was a bit-rot / bit-flip in
> memory would that's not detected or fixed by ECC - high level consistency
> checks would pick it.
> * despite numerous attempts i cannot reproduce the issue in test
> environment, possibly due to much lower level of write traffic. it takes
> days to weeks to have it re-occur on the production server.
>
> Thank you in advance for your suggestions how to tackle this.
>
> best regards,
> Pawel
>


I'd like to provide more details from another occurrence of this issue:


psql (11.7 (Debian 11.7-0+deb10u1), server 11.9 (Debian 11.9-0+deb10u1))

Example where GIN is returning misleading information:


data=# SELECT entity_id,attribute_name_ids  FROM entity WHERE (
attribute_name_ids && '{471}' ) AND NOT ( (attribute_name_ids||0) &&
'{471}') ;
 entity_id |                              attribute_name_ids
-----------+------------------------------------------------------------------------------
 197911914 | {2,3,4,6,7,8,9,10,13,17,35,48,66,71,3270,3273,3279,4377,4842,5379,5787,7416}
(1 row)


data=# EXPLAIN ANALYZE SELECT entity_id,attribute_name_ids  FROM
entity WHERE ( attribute_name_ids && '{471}' ) AND NOT (
(attribute_name_ids||0) && '{471}') ;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on entity  (cost=71.04..3432.73 rows=29881
width=126) (actual time=64.179..65.255 rows=1 loops=1)
   Recheck Cond: (attribute_name_ids && '{471}'::integer[])
   Filter: (NOT ((attribute_name_ids || 0) && '{471}'::integer[]))
   Rows Removed by Filter: 8814
   Heap Blocks: exact=7877
   ->  Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..71.04 rows=30031 width=0) (actual time=10.396..10.398
rows=13628 loops=1)
         Index Cond: (attribute_name_ids && '{471}'::integer[])
 Planning Time: 0.164 ms
 Execution Time: 65.458 ms
(9 rows)



When GIN is not used results are as expected:

data=# EXPLAIN ANALYZE SELECT entity_id,attribute_name_ids  FROM
entity WHERE entity_id=197911914 AND attribute_name_ids && '{471}';
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Index Scan using entity_pkey on entity  (cost=0.00..0.20 rows=1
width=126) (actual time=0.090..0.091 rows=0 loops=1)
   Index Cond: (entity_id = 197911914)
   Filter: (attribute_name_ids && '{471}'::integer[])
   Rows Removed by Filter: 1
 Planning Time: 0.233 ms
 Execution Time: 0.116 ms
(6 rows)

data=# SELECT entity_id,attribute_name_ids  FROM entity WHERE
entity_id=197911914 AND attribute_name_ids && '{471}';
 entity_id | attribute_name_ids
-----------+--------------------
(0 rows)


From my observations introducing of misleading entry in GIN coincides
with update of database row that will be incorrectly reported.

Fixing the issue can be done by:

UPDATE entity SET  attribute_name_ids ='{}' WHERE entity_id=197911914;
UPDATE entity SET  attribute_name_ids
='{2,3,4,6,7,8,9,10,13,17,35,48,66,71,3270,3273,3279,4377,4842,5379,5787,7416}'
WHERE entity_id=197911914';


The table has following structure:

        Column         |              Type              | Collation |
Nullable |    Default
-----------------------+--------------------------------+-----------+----------+---------------
 entity_id             | bigint                         |           | not null |
 attribute_value_ids   | integer[]                      |           | not null |
 attribute_name_ids    | integer[]                      |           | not null |
[..]
Indexes:
    "entity_pkey" PRIMARY KEY, btree (entity_id)
    "entity_attribute_name_ids_gin" gin (attribute_name_ids gin__int_ops)
    "entity_attribute_value_ids_gin" gin (attribute_value_ids gin__int_ops)
[..]

Disk footprint of it is ~ 60GB, ~38M rows. In fact it's a single table
in this database.

The table gets tens of millions of updates each day and maybe 10-100k
inserts with very little deletes.

I have taken a copy of database files from streaming replication slave
server. i also have WAL files from around the time that this
particular corruption occurred.

Is there any additional information I can provide?

thank you!


-- 
regards,
Pawel Kudzia



Based on the suggestions we've received from IRC we've:
* Moved some of our postgresql instances to another
  physical server,
* Enabled data_checksums on all instances,
* Upgraded upgraded our servers to 13.3
  (Debian 13.3-1.pgdg100+1) with hope that
https://github.com/postgres/postgres/commit/0d779d22a290a89b6c892137a37280b9588ad0cc
  addresses the issue [ although it's ts-vector
  specific, so unlikely to be related to our issue ].

Yet soon after we've observe another inconsistency.

Recap of the problem - SELECTs return rows, based on GIN,
that do not actually meet criteria expressed in WHERE.


SELECT entity_id,attribute_name_ids FROM entity WHERE
( attribute_name_ids && '{7572}' ) AND NOT
( (attribute_name_ids||0) && '{7572}') LIMIT 100 ;


 entity_id |                                   attribute_name_ids
-----------+----------------------------------------------------------------------------------------
  22327791 | {1,2,3,4,6,8,9,10,11,13,14,17,19,21,35,72,366,1659,2208,2211,3270,3273,3279,5787,7650}

(1 row)

EXPLAIN ANALYZE of the query above:


QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=35.46..46.67 rows=100 width=132) (actual
time=307.221..335.842 rows=1 loops=1)
   ->  Bitmap Heap Scan on entity  (cost=35.46..3705.32 rows=32724
width=132) (actual time=307.220..335.839 rows=1 loops=1)
         Recheck Cond: (attribute_name_ids && '{7572}'::integer[])
         Rows Removed by Index Recheck: 72012
         Filter: (NOT ((attribute_name_ids || 0) && '{7572}'::integer[]))
         Rows Removed by Filter: 21501
         Heap Blocks: exact=8998 lossy=9257
         ->  Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..35.46 rows=32889 width=0) (actual time=19.790..19.790
rows=115485 loops=1)
               Index Cond: (attribute_name_ids && '{7572}'::integer[])
 Planning Time: 0.068 ms
 Execution Time: 335.879 ms
(11 rows)

Relevant part of the table structure:

CREATE TABLE public.entity (
    entity_id bigint NOT NULL,
    attribute_value_ids integer[] NOT NULL,
    attribute_name_ids integer[] NOT NULL,
[..]
);


ALTER TABLE ONLY public.entity ADD CONSTRAINT entity_pkey PRIMARY KEY
(entity_id);
CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids public.gin__int_ops);
CREATE INDEX entity_attribute_value_ids_gin ON public.entity USING gin
(attribute_value_ids public.gin__int_ops);


There are at leat tens of updates / inserts per second to the table all the
time. the issue above manifests very rarely, after few days if not weeks of
uptime. we did not find any deterministic way of reproducing it, but it's
a matter of time till it shows.

The issue persist is related to on-disk data, it replicates from
streaming replication
masters to slaves.

Thank you in advance for your suggestions how to tackle this.

-- 
regards,
Pawel Kudzia



with help from IRC we've found that decreasing work_mem from 1MB to 256kB
or less makes the problem go away:

data=# show work_mem;
 work_mem
----------
 256kB
(1 row)

data=# SELECT entity_id,attribute_name_ids FROM entity WHERE (
attribute_name_ids && '{7572}' ) AND NOT ( (attribute_name_ids||0) &&
'{7572}') LIMIT 100 ;
 entity_id | attribute_name_ids
-----------+--------------------
(0 rows)

data=# explain analyze SELECT entity_id,attribute_name_ids FROM entity
WHERE ( attribute_name_ids && '{7572}' ) AND NOT (
(attribute_name_ids||0) && '{7572}') LIMIT 100 ;
                                                                     QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=26.53..38.04 rows=100 width=132) (actual
time=110.013..110.015 rows=0 loops=1)
   ->  Bitmap Heap Scan on entity  (cost=26.53..3780.78 rows=32606
width=132) (actual time=110.011..110.011 rows=0 loops=1)
         Recheck Cond: (attribute_name_ids && '{7572}'::integer[])
         Rows Removed by Index Recheck: 102983
         Filter: (NOT ((attribute_name_ids || 0) && '{7572}'::integer[]))
         Rows Removed by Filter: 21501
         Heap Blocks: exact=898 lossy=13752
         ->  Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..26.53 rows=32770 width=0) (actual time=3.582..3.583
rows=21518 loops=1)
               Index Cond: (attribute_name_ids && '{7572}'::integer[])
 Planning Time: 0.173 ms
 Execution Time: 110.220 ms
(11 rows)


problem manifests again with work_mem increased to 512kB or higher:

data=# show work_mem;
 work_mem
----------
 512kB
(1 row)

data=# SELECT entity_id,attribute_name_ids FROM entity WHERE (
attribute_name_ids && '{7572}' ) AND NOT ( (attribute_name_ids||0) &&
'{7572}') LIMIT 100 ;
 entity_id |                                   attribute_name_ids
-----------+----------------------------------------------------------------------------------------
  22327791 | {1,2,3,4,6,8,9,10,11,13,14,17,19,21,35,72,366,1659,2208,2211,3270,3273,3279,5787,7650}
(1 row)

data=# explain analyze SELECT entity_id,attribute_name_ids FROM entity
WHERE ( attribute_name_ids && '{7572}' ) AND NOT (
(attribute_name_ids||0) && '{7572}') LIMIT 100 ;
                                                                     QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=26.73..38.14 rows=100 width=132) (actual
time=112.268..119.475 rows=1 loops=1)
   ->  Bitmap Heap Scan on entity  (cost=26.73..3748.28 rows=32606
width=132) (actual time=112.267..119.473 rows=1 loops=1)
         Recheck Cond: (attribute_name_ids && '{7572}'::integer[])
         Rows Removed by Index Recheck: 68905
         Filter: (NOT ((attribute_name_ids || 0) && '{7572}'::integer[]))
         Rows Removed by Filter: 21501
         Heap Blocks: exact=5630 lossy=9012
         ->  Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..26.73 rows=32770 width=0) (actual time=3.924..3.924
rows=21518 loops=1)
               Index Cond: (attribute_name_ids && '{7572}'::integer[])
 Planning Time: 0.113 ms
 Execution Time: 119.801 ms
(11 rows)


"vacuum table entity;" did not help, neither did
"select gin_clean_pending_list('entity_attribute_name_ids_gin');"



Pawel Kudzia <kudzia@gmail.com> writes:
> with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> or less makes the problem go away:

Hmm.  So that suggests that the index itself is *not* corrupt,
but the problem is associated with a bug in the indexscan
algorithms.

Have you experimented with different index opclasses?  Your
original report said you were using gin__int_ops, but that's
not the default, and we sort of deprecate intarray these days.

            regards, tom lane



I wrote:
> Pawel Kudzia <kudzia@gmail.com> writes:
>> with help from IRC we've found that decreasing work_mem from 1MB to 256kB
>> or less makes the problem go away:

> Hmm.  So that suggests that the index itself is *not* corrupt,
> but the problem is associated with a bug in the indexscan
> algorithms.

After staring at the code I think there is at least one bug, and
possibly two, in shimTriConsistentFn.  That's likely to be implicated
here because intarray's GIN opclass only provides a bool consistent
function.  I'm not very clear on the circumstances that lead to the scan
code inventing GIN_MAYBE inputs, so I haven't been able to construct a
test case.

The definite bug is triggered because intarray relies on the API
specification that says that if the search mode is
GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called
when there's at least one TRUE key:

        case RTOverlapStrategyNumber:
            /* result is not lossy */
            *recheck = false;
            /* at least one element in check[] is true, so result = true */
            res = true;
            break;

shimTriConsistentFn ignores this and calls it on all-FALSE inputs,
for which it'll incorrectly get a TRUE result, as it will also for
all the following checks.  The upshot is that shimTriConsistentFn
will convert any case with a MAYBE input to a hard TRUE result,
with no recheck requirement.  This'd easily explain the reported
misbehavior.  So in the spot where we do this:

    /* First call consistent function with all the maybe-inputs set FALSE */
    for (i = 0; i < nmaybe; i++)
        key->entryRes[maybeEntries[i]] = GIN_FALSE;
    curResult = directBoolConsistentFn(key);

we need to add some code that checks for default searchMode, and in
that case doesn't call the consistentFn unless at least one
(non-MAYBE) input is TRUE.

The other thing that I'm unsure about, because the code is horribly
underdocumented, is that it's not very clear whether
shimTriConsistentFn is accurately converting between the bool and
the tristate APIs.  That's because it's not very clear what the
tristate API actually *is*.  In particular, is the end state of
key->recheckCurItem meaningful in the tristate case?  If it's not,
then the short-circuit case for no MAYBE inputs is broken, because
it'll return TRUE when the bool consistentFn is trying to tell us
to recheck.  But if it is meaningful, then the multiway case is broken,
because it will return the recheckCurItem result set by the last call to
the bool consistentfn; which might be false even though other calls said
true.  (So in that case I think we'd need "key->recheckCurItem = recheck"
at the end.)  I also wonder how any of that works correctly for real
triconsistent functions, which don't have access to the recheckCurItem
flag.

Anyway, I'm punting this to the code authors.  I'd like to see
some comments clarifying what the API really is, not just a
quick-n-dirty code fix.

            regards, tom lane



Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Alexander Korotkov
Дата:
On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
> > Pawel Kudzia <kudzia@gmail.com> writes:
> >> with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> >> or less makes the problem go away:
>
> > Hmm.  So that suggests that the index itself is *not* corrupt,
> > but the problem is associated with a bug in the indexscan
> > algorithms.
>
> After staring at the code I think there is at least one bug, and
> possibly two, in shimTriConsistentFn.  That's likely to be implicated
> here because intarray's GIN opclass only provides a bool consistent
> function.  I'm not very clear on the circumstances that lead to the scan
> code inventing GIN_MAYBE inputs, so I haven't been able to construct a
> test case.
>
> The definite bug is triggered because intarray relies on the API
> specification that says that if the search mode is
> GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called
> when there's at least one TRUE key:
>
>         case RTOverlapStrategyNumber:
>             /* result is not lossy */
>             *recheck = false;
>             /* at least one element in check[] is true, so result = true */
>             res = true;
>             break;
>
> shimTriConsistentFn ignores this and calls it on all-FALSE inputs,
> for which it'll incorrectly get a TRUE result, as it will also for
> all the following checks.  The upshot is that shimTriConsistentFn
> will convert any case with a MAYBE input to a hard TRUE result,
> with no recheck requirement.  This'd easily explain the reported
> misbehavior.  So in the spot where we do this:
>
>     /* First call consistent function with all the maybe-inputs set FALSE */
>     for (i = 0; i < nmaybe; i++)
>         key->entryRes[maybeEntries[i]] = GIN_FALSE;
>     curResult = directBoolConsistentFn(key);
>
> we need to add some code that checks for default searchMode, and in
> that case doesn't call the consistentFn unless at least one
> (non-MAYBE) input is TRUE.

 +1,
I think in default search mode we can just start with curResult equal
to GIN_FALSE instead of calling directBoolConsistentFn().

> The other thing that I'm unsure about, because the code is horribly
> underdocumented, is that it's not very clear whether
> shimTriConsistentFn is accurately converting between the bool and
> the tristate APIs.  That's because it's not very clear what the
> tristate API actually *is*.  In particular, is the end state of
> key->recheckCurItem meaningful in the tristate case?  If it's not,
> then the short-circuit case for no MAYBE inputs is broken, because
> it'll return TRUE when the bool consistentFn is trying to tell us
> to recheck.  But if it is meaningful, then the multiway case is broken,
> because it will return the recheckCurItem result set by the last call to
> the bool consistentfn; which might be false even though other calls said
> true.  (So in that case I think we'd need "key->recheckCurItem = recheck"
> at the end.)  I also wonder how any of that works correctly for real
> triconsistent functions, which don't have access to the recheckCurItem
> flag.

As far as I recall, returning MAYBE for no MAYBE inputs in the
triConsistent function is equivalent to setting the recheck flag in
the bool consistent function.  Thus, short-circuit case for no MAYBE
inputs should be broken.

> Anyway, I'm punting this to the code authors.  I'd like to see
> some comments clarifying what the API really is, not just a
> quick-n-dirty code fix.

Yep, I'm going to have a closer look at this tomorrow.

------
Regards,
Alexander Korotkov



Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> we need to add some code that checks for default searchMode, and in
>> that case doesn't call the consistentFn unless at least one
>> (non-MAYBE) input is TRUE.

> I think in default search mode we can just start with curResult equal
> to GIN_FALSE instead of calling directBoolConsistentFn().

No, it's not that simple, because there might be other keys that are
TRUE not MAYBE.  So the result could legitimately be TRUE in this case.

BTW, I think it'd be a really good idea for this function to restore
all the MAYBE entries to GIN_MAYBE before returning, so that you can
remove the caveat that it destroys the contents of entryRes[].  I have
basically zero faith that that's safe, and it seems pretty cheap to
not have to make the assumption.

            regards, tom lane



Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Alexander Korotkov
Дата:
On Fri, Jun 18, 2021 at 1:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> we need to add some code that checks for default searchMode, and in
> >> that case doesn't call the consistentFn unless at least one
> >> (non-MAYBE) input is TRUE.
>
> > I think in default search mode we can just start with curResult equal
> > to GIN_FALSE instead of calling directBoolConsistentFn().
>
> No, it's not that simple, because there might be other keys that are
> TRUE not MAYBE.  So the result could legitimately be TRUE in this case.

Yes, that's it.  Thank you for correction.

> BTW, I think it'd be a really good idea for this function to restore
> all the MAYBE entries to GIN_MAYBE before returning, so that you can
> remove the caveat that it destroys the contents of entryRes[].  I have
> basically zero faith that that's safe, and it seems pretty cheap to
> not have to make the assumption.

+1, sounds like a good idea.

------
Regards,
Alexander Korotkov



Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Alexander Korotkov
Дата:
On Fri, Jun 18, 2021 at 12:55 AM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Anyway, I'm punting this to the code authors.  I'd like to see
> > some comments clarifying what the API really is, not just a
> > quick-n-dirty code fix.
>
> Yep, I'm going to have a closer look at this tomorrow.

Sorry for the delay.  I'm going to take a closer look in the next
couple of days.

------
Regards,
Alexander Korotkov



IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 17/06/2021 22:49, Tom Lane wrote:
> I wrote:
>> Pawel Kudzia <kudzia@gmail.com> writes:
>>> with help from IRC we've found that decreasing work_mem from 1MB to 256kB
>>> or less makes the problem go away:
> 
>> Hmm.  So that suggests that the index itself is *not* corrupt,
>> but the problem is associated with a bug in the indexscan
>> algorithms.
> 
> After staring at the code I think there is at least one bug, and
> possibly two, in shimTriConsistentFn.  That's likely to be implicated
> here because intarray's GIN opclass only provides a bool consistent
> function.  I'm not very clear on the circumstances that lead to the scan
> code inventing GIN_MAYBE inputs, so I haven't been able to construct a
> test case.
> 
> The definite bug is triggered because intarray relies on the API
> specification that says that if the search mode is
> GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called
> when there's at least one TRUE key:
> 
>          case RTOverlapStrategyNumber:
>              /* result is not lossy */
>              *recheck = false;
>              /* at least one element in check[] is true, so result = true */
>              res = true;
>              break;
> 
> shimTriConsistentFn ignores this and calls it on all-FALSE inputs,
> for which it'll incorrectly get a TRUE result, as it will also for
> all the following checks.  The upshot is that shimTriConsistentFn
> will convert any case with a MAYBE input to a hard TRUE result,
> with no recheck requirement.  This'd easily explain the reported
> misbehavior.

That's subtle. The documentation says:

>        If <literal>*searchMode</literal> is set to
>        <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is
>        initialized to before call), only items that match at least one of
>        the returned keys are considered candidate matches.

I guess you can read that as "consistentFn will only be called when 
there is at least one matching item", but that didn't occur to me when I 
read that at first. But I agree we need to fix shimTriConsistentFn to 
respect that.

> The other thing that I'm unsure about, because the code is horribly
> underdocumented, is that it's not very clear whether
> shimTriConsistentFn is accurately converting between the bool and
> the tristate APIs.  That's because it's not very clear what the
> tristate API actually *is*.  In particular, is the end state of
> key->recheckCurItem meaningful in the tristate case?  If it's not,
> then the short-circuit case for no MAYBE inputs is broken, because
> it'll return TRUE when the bool consistentFn is trying to tell us
> to recheck.  But if it is meaningful, then the multiway case is broken,
> because it will return the recheckCurItem result set by the last call to
> the bool consistentfn; which might be false even though other calls said
> true.  (So in that case I think we'd need "key->recheckCurItem = recheck"
> at the end.)  I also wonder how any of that works correctly for real
> triconsistent functions, which don't have access to the recheckCurItem
> flag.
> 
> Anyway, I'm punting this to the code authors.  I'd like to see
> some comments clarifying what the API really is, not just a
> quick-n-dirty code fix.

I've been trying to create a test case for this, but no luck so far. I 
cannot find a codepath that would hit these bugs with the kind of query 
that Pawel used. The search condition is very simple: "column && 
'{constant}'", with only one key and one constant to search for. There 
are three calls to triConsistentFn:

1. In startScanKey, but that's only reached if (key->nentries > 1), so 
not reachable with Pawel's query.

2. In keyGetItem, when haveLossyEntry==true. But AFAICS, haveLossyEntry 
is never true with Pawel's query. That would require curItem to be a 
lossy pointer, and that in turn can only happen when matchBitmap is 
used, and that's only used with partial match queries and with queries 
that need a full-index scan.

3. The second call in keyGetItem is reached, but it is only passed 
GIN_MAYBE when curItem is lossy. Which isn't possible with this query, 
see point 2.

AFAICS the recheckCurItem bug should also not cause wrong results with 
that query. I must be missing something.

I could probably construct a test case with a different query, one 
involving multiple elements in the search key, but I'd like to have a 
solid explanation for the original report.

Pawel, is this a production system, or would it be possible for you to 
build from sources with a patch with some extra debugging output?

- Heikki



>
> Pawel, is this a production system, or would it be possible for you to
> build from sources with a patch with some extra debugging output?
>

thank you for looking into it!

it is a production system but we have couple of replicas. when the
problem occurs - it can be reproduced on any replica as well. when
it happens again - i can stop one of replicas, take file level backup
and set up a test machine on which i can try to build PG from source,
including your patch adding extra verbosity.

please note that the last inconsistency was fixed, but i expect that
new one will show up within few days.

greetings!

-- 
regards,
Pawel Kudzia



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Alexander Korotkov
Дата:
(On Thu, Jun 24, 2021 at 3:59 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 17/06/2021 22:49, Tom Lane wrote:
> > I wrote:
> >> Pawel Kudzia <kudzia@gmail.com> writes:
> >>> with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> >>> or less makes the problem go away:
> >
> >> Hmm.  So that suggests that the index itself is *not* corrupt,
> >> but the problem is associated with a bug in the indexscan
> >> algorithms.
> >
> > After staring at the code I think there is at least one bug, and
> > possibly two, in shimTriConsistentFn.  That's likely to be implicated
> > here because intarray's GIN opclass only provides a bool consistent
> > function.  I'm not very clear on the circumstances that lead to the scan
> > code inventing GIN_MAYBE inputs, so I haven't been able to construct a
> > test case.
> >
> > The definite bug is triggered because intarray relies on the API
> > specification that says that if the search mode is
> > GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called
> > when there's at least one TRUE key:
> >
> >          case RTOverlapStrategyNumber:
> >              /* result is not lossy */
> >              *recheck = false;
> >              /* at least one element in check[] is true, so result = true */
> >              res = true;
> >              break;
> >
> > shimTriConsistentFn ignores this and calls it on all-FALSE inputs,
> > for which it'll incorrectly get a TRUE result, as it will also for
> > all the following checks.  The upshot is that shimTriConsistentFn
> > will convert any case with a MAYBE input to a hard TRUE result,
> > with no recheck requirement.  This'd easily explain the reported
> > misbehavior.
>
> That's subtle. The documentation says:
>
> >        If <literal>*searchMode</literal> is set to
> >        <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is
> >        initialized to before call), only items that match at least one of
> >        the returned keys are considered candidate matches.
>
> I guess you can read that as "consistentFn will only be called when
> there is at least one matching item", but that didn't occur to me when I
> read that at first. But I agree we need to fix shimTriConsistentFn to
> respect that.
>
> > The other thing that I'm unsure about, because the code is horribly
> > underdocumented, is that it's not very clear whether
> > shimTriConsistentFn is accurately converting between the bool and
> > the tristate APIs.  That's because it's not very clear what the
> > tristate API actually *is*.  In particular, is the end state of
> > key->recheckCurItem meaningful in the tristate case?  If it's not,
> > then the short-circuit case for no MAYBE inputs is broken, because
> > it'll return TRUE when the bool consistentFn is trying to tell us
> > to recheck.  But if it is meaningful, then the multiway case is broken,
> > because it will return the recheckCurItem result set by the last call to
> > the bool consistentfn; which might be false even though other calls said
> > true.  (So in that case I think we'd need "key->recheckCurItem = recheck"
> > at the end.)  I also wonder how any of that works correctly for real
> > triconsistent functions, which don't have access to the recheckCurItem
> > flag.
> >
> > Anyway, I'm punting this to the code authors.  I'd like to see
> > some comments clarifying what the API really is, not just a
> > quick-n-dirty code fix.
>
> I've been trying to create a test case for this, but no luck so far. I
> cannot find a codepath that would hit these bugs with the kind of query
> that Pawel used. The search condition is very simple: "column &&
> '{constant}'", with only one key and one constant to search for. There
> are three calls to triConsistentFn:
>
> 1. In startScanKey, but that's only reached if (key->nentries > 1), so
> not reachable with Pawel's query.
>
> 2. In keyGetItem, when haveLossyEntry==true. But AFAICS, haveLossyEntry
> is never true with Pawel's query. That would require curItem to be a
> lossy pointer, and that in turn can only happen when matchBitmap is
> used, and that's only used with partial match queries and with queries
> that need a full-index scan.
>
> 3. The second call in keyGetItem is reached, but it is only passed
> GIN_MAYBE when curItem is lossy. Which isn't possible with this query,
> see point 2.
>
> AFAICS the recheckCurItem bug should also not cause wrong results with
> that query. I must be missing something.
>
> I could probably construct a test case with a different query, one
> involving multiple elements in the search key, but I'd like to have a
> solid explanation for the original report.

+1,
I see query provided by Pawel is too simple.  Bugs, which Tom digged
in shimTriConsistentFn() don't seem to explain the wrong query results
provided by Pawel.

> Pawel, is this a production system, or would it be possible for you to
> build from sources with a patch with some extra debugging output?

Do you think it also worth checking whether bug persists when set
fastupdate = off.  Then we could localize whether bug is related to
pending lists.

------
Regards,
Alexander Korotkov



On Thu, Jun 24, 2021 at 3:32 PM Pawel Kudzia <kudzia@gmail.com> wrote:
> > Pawel, is this a production system, or would it be possible for you to
> > build from sources with a patch with some extra debugging output?
> thank you for looking into it!
>
> it is a production system but we have couple of replicas. when the
> problem occurs - it can be reproduced on any replica as well. when
> it happens again - i can stop one of replicas, take file level backup
> and set up a test machine on which i can try to build PG from source,
> including your patch adding extra verbosity.
>
> please note that the last inconsistency was fixed, but i expect that
> new one will show up within few days.


hello again,

i have a consistent file-level backup of postgresql's /var/lib/postgresql +
/etc/postgresql on which i can reproduce the issue reliably. it's on a test
machine where we can put patched version of PG. currently this machine
is using Debian 13.3-1.pgdg100+1.

set enable_seqscan=off;
SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' )
AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10;

returns me 2 rows while it should return none.

also in this case lowering work_mem from 1MB to 256kB makes fixes the
issue - SELECT returns 0 rows instead of 2.

i'll be happy to run patched version and send you back logs produced by it.

greetings!

-- 
regards,
Pawel Kudzia



Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Alexander Korotkov
Дата:
( . ()  w On Tue, Jun 22, 2021 at 7:02 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Fri, Jun 18, 2021 at 12:55 AM Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
> > On Thu, Jun 17, 2021 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Anyway, I'm punting this to the code authors.  I'd like to see
> > > some comments clarifying what the API really is, not just a
> > > quick-n-dirty code fix.
> >
> > Yep, I'm going to have a closer look at this tomorrow.
>
> Sorry for the delay.  I'm going to take a closer look in the next
> couple of days.

I've closer look at shimTriConsistentFn() function.  It looks to me
like the function itself has inconsistencies.  But the way it's
currently used in GIN shouldn't produce the wrong query answers.

shimTriConsistentFn() is one of implementation of
GinScanKey.triConsistentFn.  In turn, GinScanKey.triConsistentFn have
3 callers:
1) startScanKey() to determine required keys
2) keyGetItem() for lossy item check
3) keyGetItem() for normal item check

Let's see shimTriConsistentFn() inconsistencies and how callers handle them.
1) shimTriConsistentFn() returns result of directBoolConsistentFn()
for zero maybe entries without examining key->recheckCurItem.  That
may result in returning GIN_TRUE instead of GIN_MAYBE
1.1) startScanKey() doesn't care about recheck, just looking for
GIN_FALSE result.
1.2) keyGetItem() for lossy item always implies recheck.
1.3) keyGetItem() for a normal item does the trick.  Whether a current
item is rechecked is controlled by key->recheckCurItem variable (the
same which is set by directBoolConsistentFn().  The following switch
sets key->recheckCurItem for GIN_MAYBE, but leaves it untouched for
GIN_TRUE.  Thus, GIN_TRUE with key->recheckCurItem works here just
like GIN_MAYBE.  I think this is inconsistent by itself, but this
inconsistency compensates for inconsistency in shimTriConsistentFn().
2) shimTriConsistentFn() might call directBoolConsistentFn() with all
false inputs for GIN_SEARCH_MODE_DEFAULT.  The result is intended to
be false, but opclass consistent method isn't intended to handle this
situation correctly.  For instance, ginint4_consistent() returns true
without checking the input array.  That could make
shimTriConsistentFn() return GIN_TRUE instead of GIN_MAYBE, or
GIN_MAYBE instead of GIN_FALSE.
2.1) In principle, this could lead startScanKey() to select less
required entries than possible.  But this is definitely not the case
of ginint4_consistent() when meeting any of entries is enough for
match.
2.2) In principle, keyGetItem() could get false positives for lossy
items.  But that wouldn't lead to wrong query answers.  Again, this is
not the case of ginint4_consistent().
2.3) keyGetItem() for a normal item doesn't call triConsistentFn()
with any GIN_MAYBE or all GIN_FALSE.

To sum up, I don't see how inconsistencies in shimTriConsistentFn()
could lead to wrong query answers, especially in intarray GIN index.
Nevertheless, these inconsistencies should be fixed.  I'm going to
propose a patch soon.

------
Regards,
Alexander Korotkov



Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 11/07/2021 00:56, Alexander Korotkov wrote:
> I've closer look at shimTriConsistentFn() function.  It looks to me
> like the function itself has inconsistencies.  But the way it's
> currently used in GIN shouldn't produce the wrong query answers.
> 
> shimTriConsistentFn() is one of implementation of
> GinScanKey.triConsistentFn.  In turn, GinScanKey.triConsistentFn have
> 3 callers:
> 1) startScanKey() to determine required keys
> 2) keyGetItem() for lossy item check
> 3) keyGetItem() for normal item check
> 
> Let's see shimTriConsistentFn() inconsistencies and how callers handle them.
> 1) shimTriConsistentFn() returns result of directBoolConsistentFn()
> for zero maybe entries without examining key->recheckCurItem.  That
> may result in returning GIN_TRUE instead of GIN_MAYBE
> 1.1) startScanKey() doesn't care about recheck, just looking for
> GIN_FALSE result.
> 1.2) keyGetItem() for lossy item always implies recheck.
> 1.3) keyGetItem() for a normal item does the trick.  Whether a current
> item is rechecked is controlled by key->recheckCurItem variable (the
> same which is set by directBoolConsistentFn().  The following switch
> sets key->recheckCurItem for GIN_MAYBE, but leaves it untouched for
> GIN_TRUE.  Thus, GIN_TRUE with key->recheckCurItem works here just
> like GIN_MAYBE.  I think this is inconsistent by itself, but this
> inconsistency compensates for inconsistency in shimTriConsistentFn().
> 2) shimTriConsistentFn() might call directBoolConsistentFn() with all
> false inputs for GIN_SEARCH_MODE_DEFAULT.  The result is intended to
> be false, but opclass consistent method isn't intended to handle this
> situation correctly.  For instance, ginint4_consistent() returns true
> without checking the input array.  That could make
> shimTriConsistentFn() return GIN_TRUE instead of GIN_MAYBE, or
> GIN_MAYBE instead of GIN_FALSE.
> 2.1) In principle, this could lead startScanKey() to select less
> required entries than possible.  But this is definitely not the case
> of ginint4_consistent() when meeting any of entries is enough for
> match.
> 2.2) In principle, keyGetItem() could get false positives for lossy
> items.  But that wouldn't lead to wrong query answers.  Again, this is
> not the case of ginint4_consistent().
> 2.3) keyGetItem() for a normal item doesn't call triConsistentFn()
> with any GIN_MAYBE or all GIN_FALSE.
> 
> To sum up, I don't see how inconsistencies in shimTriConsistentFn()
> could lead to wrong query answers, especially in intarray GIN index.

Agreed, I came to the same conclusion looking at the code. Which means 
that we still don't have an explanation for the original bug report :-(.

> Nevertheless, these inconsistencies should be fixed.  I'm going to
> propose a patch soon.

Thanks! I came up with the attached patch. I changed 
directBoolConsistentFn() to return a GinTernaryValue rather than bool, I 
think that makes the logic in shimTriConsistentFn() more clear.

I also tried to write a test case to expose issue 2.1 above, but 
couldn't come up with an example.

- Heikki

Вложения

Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 05/07/2021 16:36, Pawel Kudzia wrote:
> i have a consistent file-level backup of postgresql's /var/lib/postgresql +
> /etc/postgresql on which i can reproduce the issue reliably. it's on a test
> machine where we can put patched version of PG. currently this machine
> is using Debian 13.3-1.pgdg100+1.
> 
> set enable_seqscan=off;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' )
> AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10;
> 
> returns me 2 rows while it should return none.
> 
> also in this case lowering work_mem from 1MB to 256kB makes fixes the
> issue - SELECT returns 0 rows instead of 2.
> 
> i'll be happy to run patched version and send you back logs produced by it.

Thanks! Here's a patch that prints lines with "GINBUG:" prefix in the 
log. It should apply cleanly to PostgreSQL v13.

This should help to confirm whether the bugs in shimTriConsistentFn that 
Tom pointed out are the root cause of this issue. It should also tell us 
whether the rows are being returned from the pending-inserts list or the 
regular part of the GIN index.

We might need a few iterations to test different theories, but we'll get 
there...

- Heikki

Ps. Sorry for the delay, I didn't see you reply until now, it went into 
spam folder

Вложения
On Sun, Jul 11, 2021 at 8:42 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 05/07/2021 16:36, Pawel Kudzia wrote:
> > i have a consistent file-level backup of postgresql's /var/lib/postgresql +
> > /etc/postgresql on which i can reproduce the issue reliably. it's on a test
> > machine where we can put patched version of PG. currently this machine
> > is using Debian 13.3-1.pgdg100+1.
> >
> > set enable_seqscan=off;
> > SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' )
> > AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10;
> >
> > returns me 2 rows while it should return none.
> >
> > also in this case lowering work_mem from 1MB to 256kB makes fixes the
> > issue - SELECT returns 0 rows instead of 2.
> >
> > i'll be happy to run patched version and send you back logs produced by it.
>
> Thanks! Here's a patch that prints lines with "GINBUG:" prefix in the
> log. It should apply cleanly to PostgreSQL v13.
>
> This should help to confirm whether the bugs in shimTriConsistentFn that
> Tom pointed out are the root cause of this issue. It should also tell us
> whether the rows are being returned from the pending-inserts list or the
> regular part of the GIN index.
>
> We might need a few iterations to test different theories, but we'll get
> there...
>

Heikki - thank you for the patch! it applied cleanly.

i've run these two commands:

SET enable_seqscan=off;
SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' )
AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10;

and got this in the output:

2021-07-12 07:41:05 UTC LOG:  GINBUG: startScanKey called: excludeOnly
0 nentries 1

2021-07-12 07:41:05 UTC STATEMENT:  SELECT entity_id FROM entity WHERE
( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
&& '{1737}') LIMIT 10;
2021-07-12 07:41:05 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0


the "STATEMENT" line is repeated 79586 times, the "GINBUG: called
triConsistentFn" - 79585 times.

there's nothing else in the log besides information about server startup.

i'll be happy to help with the follow up checks.


> Ps. Sorry for the delay, I didn't see you reply until now, it went into
> spam folder

no worries!


-- 
regards,
Pawel Kudzia



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 12/07/2021 20:24, Pawel Kudzia wrote:
> i've run these two commands:
> 
> SET enable_seqscan=off;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1737}' )
> AND NOT ( (attribute_name_ids||0) && '{1737}') LIMIT 10;
> 
> and got this in the output:
> 
> 2021-07-12 07:41:05 UTC LOG:  GINBUG: startScanKey called: excludeOnly
> 0 nentries 1
> 
> 2021-07-12 07:41:05 UTC STATEMENT:  SELECT entity_id FROM entity WHERE
> ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
> && '{1737}') LIMIT 10;
> 2021-07-12 07:41:05 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0
> 
> 
> the "STATEMENT" line is repeated 79586 times, the "GINBUG: called
> triConsistentFn" - 79585 times.
> 
> there's nothing else in the log besides information about server startup.
> 
> i'll be happy to help with the follow up checks.

Ok, that confirms that it's taking the simple path through 
triConsistentFn(), with a single non-lossy key. Also, there don't seem 
to be any pending inserts. Now that we've ruled those out, I have no 
clue what the problem might be.

Is the data sensitive, or you could you share it? I realize it's a 
pretty large database, but if I had a copy I could poke around to look 
at the index pages involved here, to see if they're corrupt and how. If 
that's possible, you can contact me directly, off-list, to figure the 
logistics to send over a copy of the data directory.

- Heikki



On Tue, Jul 13, 2021 at 3:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Is the data sensitive, or you could you share it? I realize it's a
> pretty large database, but if I had a copy I could poke around to look
> at the index pages involved here, to see if they're corrupt and how. If
> that's possible, you can contact me directly, off-list, to figure the
> logistics to send over a copy of the data directory.

Worth noting that the Wiki describes a number two ways to do this when
reporting a bug -- using gdb, and using contrib/pageinspect:


https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB

-- 
Peter Geoghegan



On Wed, Jul 14, 2021 at 1:03 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Jul 13, 2021 at 3:25 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > Is the data sensitive, or you could you share it? I realize it's a
> > pretty large database, but if I had a copy I could poke around to look
> > at the index pages involved here, to see if they're corrupt and how. If
> > that's possible, you can contact me directly, off-list, to figure the
> > logistics to send over a copy of the data directory.
>
> Worth noting that the Wiki describes a number two ways to do this when
> reporting a bug -- using gdb, and using contrib/pageinspect:
>
>
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB
>


Heikki - the data is semi-sensitive. I cannot share the whole table.
I could if it was possible to remove some columns or isolate only
handful of rows.

But - just doing the "vacuum full entity" resolves the original issue
and select returns 0 rows, as it should.

How can I identify relevant pages to make selective dump of them for
you, using method proposed by Peter?

Would interactive session where you can get temporary access to the
server with copy of the database,
to run gdb on it be of any help?

Greetings!

-- 
regards,
Pawel Kudzia



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 14/07/2021 17:37, Pawel Kudzia wrote:
> Heikki - the data is semi-sensitive. I cannot share the whole table.
> I could if it was possible to remove some columns or isolate only
> handful of rows.
> 
> But - just doing the "vacuum full entity" resolves the original issue
> and select returns 0 rows, as it should.

Ok. One idea would be to write a program or extension that overwrites 
all the uninteresting columns with zeros.

> How can I identify relevant pages to make selective dump of them for
> you, using method proposed by Peter?

It's a bit hard without knowing which index pages are affected. But the 
attached script (dump-gin-page-info.sql) dumps some summary information 
about the kind of pages there are. That's a start.

> Would interactive session where you can get temporary access to the
> server with copy of the database,
> to run gdb on it be of any help?

Let's try one thing first: I rebased a patch to add GIN support to the 
amcheck extension [1]. I kind of doubt that it will catch the corruption 
in this case, but you never know. So please apply the attached 
v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I 
posted at [1], but the installation script is slightly different to make 
it work on PostgreSQL v13. Then compile install it:

   cd contrib/amcheck; make install

Then in psql:

   CREATE EXTENSION amcheck;
   gin_index_parent_check('entity_attribute_name_ids_gin');

If it reports any inconsistencies, we can try to use pageinspect to 
gather further evidence on them.

[1] 
https://www.postgresql.org/message-id/a5bc2aad-464b-91bd-061d-28af0f9b634c%40iki.fi
- Heikki

Вложения
> > How can I identify relevant pages to make selective dump of them for
> > you, using method proposed by Peter?
>
> It's a bit hard without knowing which index pages are affected. But the
> attached script (dump-gin-page-info.sql) dumps some summary information
> about the kind of pages there are. That's a start.

I'm attaching output of the proposed commands.


> Let's try one thing first: I rebased a patch to add GIN support to the
> amcheck extension [1]. I kind of doubt that it will catch the corruption
> in this case, but you never know. So please apply the attached
> v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I
> posted at [1], but the installation script is slightly different to make
> it work on PostgreSQL v13. Then compile install it:
>
>    cd contrib/amcheck; make install
>
> Then in psql:
>
>    CREATE EXTENSION amcheck;
>    gin_index_parent_check('entity_attribute_name_ids_gin');

Sadly i'm getting this when executing CREATE EXTENSION amcheck;

ERROR:  extension "amcheck" has no installation script nor update path
for version "1.2.1"


Thank you!

-- 
regards,
Pawel Kudzia

Вложения

Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Alexander Korotkov
Дата:
On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> Do you think it also worth checking whether bug persists when set
> fastupdate = off.  Then we could localize whether bug is related to
> pending lists.

I still think this is worth checking.  Despite the pending list wasn't
involved in the index scan with wrong results, the bug could be
related to insertion to the pending list.  Or it could be related to
moving entries from the pending list to the posting list/tree.

------
Regards,
Alexander Korotkov



> On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
> > Do you think it also worth checking whether bug persists when set
> > fastupdate = off.  Then we could localize whether bug is related to
> > pending lists.
>
> I still think this is worth checking.  Despite the pending list wasn't
> involved in the index scan with wrong results, the bug could be
> related to insertion to the pending list.  Or it could be related to
> moving entries from the pending list to the posting list/tree.
>

regarding fastupdate - i'd like to clarify. do i understand correctly
that this parameter affects what happens when rows are
inserted/updated/deleted?
if yes - we no longer have such workload on the production and we
cannot anymore and i was never able to find reproducible scenario.
the production environment was moved away from index built "USING gin
(attribute_name_ids public.gin__int_ops)" to index "USING gin
(attribute_name_ids)".

the only thing i have right two file-level backups of postgresql data
directory on which i know that SELECTs return rows that actually don't
meet provided criteria.

is there any point in testing fastupdate = off on those two snapshots?


-- 
regards,
Pawel Kudzia



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Alexander Korotkov
Дата:
On Thu, Jul 15, 2021 at 4:27 PM Pawel Kudzia <kudzia@gmail.com> wrote:
> > On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov
> > <aekorotkov@gmail.com> wrote:
> > > Do you think it also worth checking whether bug persists when set
> > > fastupdate = off.  Then we could localize whether bug is related to
> > > pending lists.
> >
> > I still think this is worth checking.  Despite the pending list wasn't
> > involved in the index scan with wrong results, the bug could be
> > related to insertion to the pending list.  Or it could be related to
> > moving entries from the pending list to the posting list/tree.
> >
>
> regarding fastupdate - i'd like to clarify. do i understand correctly
> that this parameter affects what happens when rows are
> inserted/updated/deleted?

Yes, that's it.

> if yes - we no longer have such workload on the production and we
> cannot anymore and i was never able to find reproducible scenario.
> the production environment was moved away from index built "USING gin
> (attribute_name_ids public.gin__int_ops)" to index "USING gin
> (attribute_name_ids)".

Do I understand correctly that after the production environment moved
away from the usage of contrib/intarray opclass to builtin opclass,
the error has gone?

> the only thing i have right two file-level backups of postgresql data
> directory on which i know that SELECTs return rows that actually don't
> meet provided criteria.
>
> is there any point in testing fastupdate = off on those two snapshots?

OK, I see.  No point in trying fastupdate = off unless we can try to
reproduce the index corruption.

------
Regards,
Alexander Korotkov




On Thu, Jul 15, 2021 at 4:27 PM Pawel Kudzia <kudzia@gmail.com> wrote:
> > On Fri, Jun 25, 2021 at 11:48 PM Alexander Korotkov
> > <aekorotkov@gmail.com> wrote:
> > > Do you think it also worth checking whether bug persists when set
> > > fastupdate = off.  Then we could localize whether bug is related to
> > > pending lists.
> >
> > I still think this is worth checking.  Despite the pending list wasn't
> > involved in the index scan with wrong results, the bug could be
> > related to insertion to the pending list.  Or it could be related to
> > moving entries from the pending list to the posting list/tree.
> >
>
> regarding fastupdate - i'd like to clarify. do i understand correctly
> that this parameter affects what happens when rows are
> inserted/updated/deleted?

Yes, that's it.

> if yes - we no longer have such workload on the production and we
> cannot anymore and i was never able to find reproducible scenario.
> the production environment was moved away from index built "USING gin
> (attribute_name_ids public.gin__int_ops)" to index "USING gin
> (attribute_name_ids)".

Do I understand correctly that after the production environment moved
away from the usage of contrib/intarray opclass to builtin opclass,
the error has gone?

it's a bit too early to be confident - i'd give it at least 2-3 more weeks before stating that the problem was solved.

in the past, when using gin__int_ops, we had periods of few consecutive days when we did not catch any inconsistent reposnes.

 
> the only thing i have right two file-level backups of postgresql data
> directory on which i know that SELECTs return rows that actually don't
> meet provided criteria.
>
> is there any point in testing fastupdate = off on those two snapshots?

OK, I see.  No point in trying fastupdate = off unless we can try to
reproduce the index corruption.

thx for the clarification! 

i'm happy to help with running more diagnostics on the database dumps that i've saved, where very specific SELECTs return rows that don't match provided criteria.
 
greetings!

--
regards,
Pawel Kudzia

Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 15/07/2021 12:38, Pawel Kudzia wrote:
>>> How can I identify relevant pages to make selective dump of them for
>>> you, using method proposed by Peter?
>>
>> It's a bit hard without knowing which index pages are affected. But the
>> attached script (dump-gin-page-info.sql) dumps some summary information
>> about the kind of pages there are. That's a start.
> 
> I'm attaching output of the proposed commands.

Thanks!

>> Let's try one thing first: I rebased a patch to add GIN support to the
>> amcheck extension [1]. I kind of doubt that it will catch the corruption
>> in this case, but you never know. So please apply the attached
>> v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I
>> posted at [1], but the installation script is slightly different to make
>> it work on PostgreSQL v13. Then compile install it:
>>
>>     cd contrib/amcheck; make install
>>
>> Then in psql:
>>
>>     CREATE EXTENSION amcheck;
>>     gin_index_parent_check('entity_attribute_name_ids_gin');
> 
> Sadly i'm getting this when executing CREATE EXTENSION amcheck;
> 
> ERROR:  extension "amcheck" has no installation script nor update path
> for version "1.2.1"

Hmm, that's odd, it worked for me. Did "make install" copy the 
"amcheck--1.2--1.2.1.sql" file to the right place? It should be 
installed the same directory as "amcheck.control".

Anyway, let's try a different tack. Here's another debugging patch. It 
causes a LOG message to be printed whenever a GIN index page is 
accessed, like this:

2021-07-15 21:22:20.119 EEST [972708] LOG:  ReadBuffer 1663/12678/27035 
read gin blk 0 (ginget.c:1832 scanPendingInsert)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT:  explain analyze select 
* from test_intarray where a @> '{2}';
2021-07-15 21:22:20.119 EEST [972708] LOG:  ReadBuffer 1663/12678/27035 
read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT:  explain analyze select 
* from test_intarray where a @> '{2}';

Please apply this patch, run the query again, and copy the resulting 
log. I'm also interested in the contents of the pages that are accessed, 
so if you could then run these commands to dump the contents of those 
index pages:

# extract just the block numbers from the log
cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print 
"$1\n" ' | sort -n |uniq > /tmp/blocknums

# dump contents of each of those index pages to '/tmp/block-contents'
psql postgres -c "create temp table blocknums(blknum int); copy 
blocknums from '/tmp/blocknums'; copy (select blknum, 
get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to 
'/tmp/block-contents';"

Please send over the resulting PostgreSQL log, and 
'/tmp/block-contents', and I'll take a closer look into the contents to 
see if there's something funny. (A private email is OK if you don't feel 
comfortable sharing it with the world).

One more thing, please also add the 'ctid' virtual column to the query, 
like this:

SELECT ctid, entity_id FROM entity WHERE
( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
&& '{1737}') LIMIT 10;

Also, did you happen to keep the WAL log from around the time that the 
query started producing wrong results? That would also be useful 
evidence for how it got into this situation.

Thanks in advance! We might need a few more round-trips, but we'll get 
there.

- Heikki

Вложения


On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 15/07/2021 12:38, Pawel Kudzia wrote:
>>> How can I identify relevant pages to make selective dump of them for
>>> you, using method proposed by Peter?
>>
>> It's a bit hard without knowing which index pages are affected. But the
>> attached script (dump-gin-page-info.sql) dumps some summary information
>> about the kind of pages there are. That's a start.
>
> I'm attaching output of the proposed commands.

Thanks!

>> Let's try one thing first: I rebased a patch to add GIN support to the
>> amcheck extension [1]. I kind of doubt that it will catch the corruption
>> in this case, but you never know. So please apply the attached
>> v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I
>> posted at [1], but the installation script is slightly different to make
>> it work on PostgreSQL v13. Then compile install it:
>>
>>     cd contrib/amcheck; make install
>>
>> Then in psql:
>>
>>     CREATE EXTENSION amcheck;
>>     gin_index_parent_check('entity_attribute_name_ids_gin');
>
> Sadly i'm getting this when executing CREATE EXTENSION amcheck;
>
> ERROR:  extension "amcheck" has no installation script nor update path
> for version "1.2.1"

Hmm, that's odd, it worked for me. Did "make install" copy the
"amcheck--1.2--1.2.1.sql" file to the right place? It should be
installed the same directory as "amcheck.control".



thank you for the hint. i've messed up patching. now i can get one step further:

data=# CREATE EXTENSION amcheck;
CREATE EXTENSION
data=# gin_index_parent_check('entity_attribute_name_ids_gin');
ERROR:  syntax error at or near "gin_index_parent_check"
LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin');
        ^

i've also applied trace-gin-readbuffer-2.patch

 
Anyway, let's try a different tack. Here's another debugging patch. It
causes a LOG message to be printed whenever a GIN index page is
accessed, like this:

2021-07-15 21:22:20.119 EEST [972708] LOG:  ReadBuffer 1663/12678/27035
read gin blk 0 (ginget.c:1832 scanPendingInsert)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT:  explain analyze select
* from test_intarray where a @> '{2}';
2021-07-15 21:22:20.119 EEST [972708] LOG:  ReadBuffer 1663/12678/27035
read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT:  explain analyze select
* from test_intarray where a @> '{2}';

Please apply this patch, run the query again, and copy the resulting
log.

that'll be the file called "log", send directly to your e-mail address.
 
I'm also interested in the contents of the pages that are accessed,
so if you could then run these commands to dump the contents of those
index pages:

# extract just the block numbers from the log
cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print
"$1\n" ' | sort -n |uniq > /tmp/blocknums

# dump contents of each of those index pages to '/tmp/block-contents'
psql postgres -c "create temp table blocknums(blknum int); copy
blocknums from '/tmp/blocknums'; copy (select blknum,
get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to
'/tmp/block-contents';"

done, side note - i had to run "create extension pageinspect" and provide name of my database - "data" rather than "postgres".


Please send over the resulting PostgreSQL log, and
'/tmp/block-contents', and I'll take a closer look into the contents to
see if there's something funny. (A private email is OK if you don't feel
comfortable sharing it with the world).

One more thing, please also add the 'ctid' virtual column to the query,
like this:

SELECT ctid, entity_id FROM entity WHERE
( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
&& '{1737}') LIMIT 10;

that's how the queries above were run. output:

     ctid     | entity_id
--------------+-----------
 (4002784,1)  |  38048120
 (4002869,14) |  95333744
(2 rows)

 
Also, did you happen to keep the WAL log from around the time that the
query started producing wrong results? That would also be useful
evidence for how it got into this situation.

sadly i don't. as noted earlier - we've already switched the production setup so i have no way of leading to corruption again, i only have two file-level backups of the database with two different inconsistencies in the SELECTs. 

greetings!

--
regards,
Pawel Kudzia
On Thu, Jul 15, 2021 at 3:56 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> I still think this is worth checking.  Despite the pending list wasn't
> involved in the index scan with wrong results, the bug could be
> related to insertion to the pending list.  Or it could be related to
> moving entries from the pending list to the posting list/tree.

If I had to guess I'd say that the chances of the pending list being
involved are high.

shiftList() deletes pages in the pending list -- this is called from
ginInsertCleanup(). But shiftList() doesn't call GinPageSetDeleteXid()
to set an XID that represents when the page is safe to recycle, which
is what ginDeletePage() always does. Why is that okay?

Note that we usually use read/share buffer locks when lock-coupling
inside ginInsertCleanup() -- so AFAICT we won't block-out concurrent
readers with a link that's about to go stale due to recycling of the
page. This looks unsafe. Of course it's very hard to tell what might
be going on, since none of this seems to be explained anywhere.

Even ginDeletePage() didn't do the right thing with XIDs until bugfix
commit 52ac6cd2d0. That commit didn't touch any pending list code --
it should of at least explained why ginInsertCleanup()/shiftList()
don't need to use the GinPageSetDeleteXid() stuff.

--
Peter Geoghegan



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 17/07/2021 02:40, Peter Geoghegan wrote:
> On Thu, Jul 15, 2021 at 3:56 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> I still think this is worth checking.  Despite the pending list wasn't
>> involved in the index scan with wrong results, the bug could be
>> related to insertion to the pending list.  Or it could be related to
>> moving entries from the pending list to the posting list/tree.
> 
> If I had to guess I'd say that the chances of the pending list being
> involved are high.
> 
> shiftList() deletes pages in the pending list -- this is called from
> ginInsertCleanup(). But shiftList() doesn't call GinPageSetDeleteXid()
> to set an XID that represents when the page is safe to recycle, which
> is what ginDeletePage() always does. Why is that okay?
> 
> Note that we usually use read/share buffer locks when lock-coupling
> inside ginInsertCleanup() -- so AFAICT we won't block-out concurrent
> readers with a link that's about to go stale due to recycling of the
> page. This looks unsafe. Of course it's very hard to tell what might
> be going on, since none of this seems to be explained anywhere.
> 
> Even ginDeletePage() didn't do the right thing with XIDs until bugfix
> commit 52ac6cd2d0. That commit didn't touch any pending list code --
> it should of at least explained why ginInsertCleanup()/shiftList()
> don't need to use the GinPageSetDeleteXid() stuff.

Hmm, seems we should fix that. But could a prematurely recycled deleted 
page cause permanent corruption?

Here's what I've found out so far, looking at the traces Pawel sent:

> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 0 (ginutil.c:636 ginGetStats)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 0 (ginget.c:1866 scanPendingInsert)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 383295 (ginget.c:1894 scanPendingInsert)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 383298 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 2 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 3 (ginget.c:1518 scanGetCandidate)
> [few hundred calls from scanGetCandidate omitted]
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 200586 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 200854 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 1452 (ginbtree.c:89 ginFindLeafPage)
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: startScanKey called: excludeOnly 0 nentries 1
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0
> ...

So, it begins by reading the metadata page, and then it scans the 
pending list. There are a few hundred pages in the pending list, but no 
matches are found there.

Then, it reads the root page at blk 1. And then, it reads page 1452.

I used pg_filedump on the root page (blk 1):

> Block    1 ********************************************************
> <Header> -----
>  Block Offset: 0x00002000         Offsets: Lower    1376 (0x0560)
>  Block: Size 8192  Version    4            Upper    2776 (0x0ad8)
>  LSN:  logid    410 recoff 0x04c6b6c0      Special  8184 (0x1ff8)
>  Items:  338                      Free Space: 1400
>  Checksum: 0xde29  Prune XID: 0x00000000  Flags: 0x0000 ()
>  Length (including item array): 1376
> 
> <Data> -----
>  Item   1 -- Length:   16  Offset: 7736 (0x1e38)  Flags: NORMAL
>  Item   2 -- Length:   16  Offset: 8168 (0x1fe8)  Flags: NORMAL
>  Item   3 -- Length:   16  Offset: 7800 (0x1e78)  Flags: NORMAL
>  ...
>  Item 337 -- Length:   16  Offset: 7032 (0x1b78)  Flags: NORMAL
>  Item 338 -- Length:   16  Offset: 8152 (0x1fd8)  Flags: NORMAL
> 
> <Special Section> -----
>  GIN Index Section:
>   Flags: 0x00000000 ()  Maxoff: 0
>   Blocks: RightLink (-1)

It has no Flags set, which means that it is an internal entry tree page. 
The items on an internal entry tree page should point to other entry 
tree pages. Now let's take a look at the next page that it read, blk 1452:

> Block 1452 ********************************************************
> <Header> -----
>  Block Offset: 0x00b58000         Offsets: Lower     562 (0x0232)
>  Block: Size 8192  Version    4            Upper    8184 (0x1ff8)
>  LSN:  logid    471 recoff 0xf0cf8a68      Special  8184 (0x1ff8)
>  Items:  134                      Free Space: 7622
>  Checksum: 0x1628  Prune XID: 0x00000000  Flags: 0x0000 ()
>  Length (including item array): 560
> 
> <Data> -----
>  Item   1 -- Length:    0  Offset:    0 (0x0000)  Flags: UNUSED
>  Item   2 -- Length:    0  Offset:    0 (0x0000)  Flags: UNUSED
>  Item   3 -- Length: 7780  Offset:    0 (0x0000)  Flags: UNUSED
>  Item   4 -- Length: 5990  Offset:    4 (0x0004)  Flags: UNUSED
>  Item   5 -- Length:    0  Offset:    6 (0x0006)  Flags: UNUSED
>  Item   6 -- Length:    2  Offset: 15559 (0x3cc7)  Flags: UNUSED
>   Error: Item contents extend beyond block.
>          BlockSize<8192> Bytes Read<8192> Item Start<15561>.
>  Item   7 -- Length:    3  Offset: 15372 (0x3c0c)  Flags: REDIRECT
>  ...
>  Item 133 -- Length: 20913  Offset:    3 (0x0003)  Flags: UNUSED
>   Error: Item contents extend beyond block.
>          BlockSize<8192> Bytes Read<8192> Item Start<20916>.
>  Item 134 -- Length:    0  Offset:    0 (0x0000)  Flags: UNUSED
> 
> <Special Section> -----
>  GIN Index Section:
>   Flags: 0x00000001 (DATA)  Maxoff: 53
>   Blocks: RightLink (-1)

On this page, the DATA flag is set, so it is an internal *posting* tree 
page.

That's weird: the scan walked straight from an internal entry tree page 
(root, at blk 1) into an internal posting tree page (blk 1452). That 
doesn't make sense to me.

The next ReadBuffer call is this:

> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 15559 (ginbtree.c:183 ginStepRight)

Where did block 15559 come from? How come we're stepping right to it? 
It's not the right sibling of the previously accessed page, 1452. In 
fact, 15559 is a leaf posting tree page. I don't understand how that 
sequence of page reads could happen.

After that, the log shows that it follows the right-links from block 
15559 onward. The incorrectly returned entries are on posting tree page 
255179. The scan reaches that page by following that chain of right-links.

I'm going to continue investigating this on Monday. I'll take a closer 
look at the index entries on those blocks to see if I can make sense of it.

- Heikki



On Fri, Jul 16, 2021 at 5:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm, seems we should fix that. But could a prematurely recycled deleted
> page cause permanent corruption?

If scans can find a page that is wholly unrelated to the expected page
(and possibly even in the wrong high level page category), then it's
really hard to predict what might happen. This could lead to real
chaos. ginInsertCleanup() makes no attempt to perform basic validation
of its assumptions about what kind of page this is, except for some
assertions. We should have something like a "can't happen" error on
!GinPageIsList() inside ginInsertCleanup() -- if we had that already
then I might be able to reason about this problem. It wouldn't hurt to
have similar checks in other code that deals with posting trees and
entry trees, too.

ginInsertCleanup() is tolerant of all kinds of things. It's not just
the lack of page-level sanity checks. It's also the basic approach to
crash safety, which relies on the fact that GIN only does lossy index
scans. My guess is that there could be lots of problems without it
being obvious to users. Things really went downhill in
ginInsertCleanup() starting in commit e956808328.

> On this page, the DATA flag is set, so it is an internal *posting* tree
> page.
>
> That's weird: the scan walked straight from an internal entry tree page
> (root, at blk 1) into an internal posting tree page (blk 1452). That
> doesn't make sense to me.

I agree that the internal entry tree page (root, at blk 1) looks sane,
from what I've seen. The tuple sizes are plausible -- 16 byte index
tuples aren't possible on an entry tree leaf page. Nor in a pending
list page.

Anyway, this is roughly the kind of bug I had in mind. It's possible
that the underlying problem doesn't actually involve
ginInsertCleanup() -- as I said we have seen similar issues elsewhere
(one such issue was fixed in commit 52ac6cd2d0). But as Alexander
pointed out, that doesn't mean much. It's possible that this problem
is 1 or 2 problems removed from the original problem, which really did
start in ginInsertCleanup() -- who knows? Why shouldn't corruption
lead to more corruption, given that we don't do much basic page level
validation? We do at least sanitize within ginStepRight(), but we need
to be more consistent about it.

> The next ReadBuffer call is this:
>
> > 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 15559 (ginbtree.c:183 ginStepRight)
>
> Where did block 15559 come from? How come we're stepping right to it?
> It's not the right sibling of the previously accessed page, 1452. In
> fact, 15559 is a leaf posting tree page. I don't understand how that
> sequence of page reads could happen.

Maybe take a look at Block 1452 using pg_hexedit? pg_hexedit is
designed to do well at interpreting quasi-corrupt data (or at least
allowing the user to do so). We see from your pg_filedump output that
the tuple contents for the page are totally wild. We should not trust
the reported right sibling page, given everything else -- is that
really what Postgres thinks the right sibling is? I mean, clearly it
doesn't.

I think it's possible that pg_filedump is interpreting it in a way
that is kind of wrong. If you saw the same page (1452) in pg_hexedit
you might spot a pattern that pg_filedump output will never reveal. At
least looking at the raw bytes might give you some idea.

-- 
Peter Geoghegan



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 16/07/2021 10:06, Pawel Kudzia wrote:
> 
> 
> On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
> 
> thank you for the hint. i've messed up patching. now i can get one step 
> further:
> 
> data=# CREATE EXTENSION amcheck;
> CREATE EXTENSION
> data=# gin_index_parent_check('entity_attribute_name_ids_gin');
> ERROR:  syntax error at or near "gin_index_parent_check"
> LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin');
>          ^

Ah, sorry, you need to call it with SELECT, like:

SELECT gin_index_parent_check('entity_attribute_name_ids_gin');

> i've also applied trace-gin-readbuffer-2.patch

Darn, I missed one function that's used to read a page when descending 
the GIN tree. That explains the seemingly nonsensical accesses in the 
log - the trace left out some crucial accesses.

Attached is a new version of that debugging patch. Please repeat the 
same steps as before with this:

1. Apply the patch (remove previous patch first if necessary)

2. Compile Postgres, "make install"

3. Clear the log, start postgres

4. Run the query again:

SELECT ctid, entity_id FROM entity WHERE
( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
&& '{1737}') LIMIT 10;

5. Stop the server.

6. Extract the content of the accessed index blocks:

cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print 
"$1\n" ' | sort -n |uniq > /tmp/blocknums

psql data -c "create temp table blocknums(blknum int); copy blocknums 
from '/tmp/blocknums'; copy (select blknum, 
get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to 
'/tmp/block-contents';"

7. Send over /tmp/blocknums, /tmp/block-contents and the postgres log again.

Thank you for your patience!

- Heikki

Вложения

Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 17/07/2021 04:56, Peter Geoghegan wrote:
> On Fri, Jul 16, 2021 at 5:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Hmm, seems we should fix that. But could a prematurely recycled deleted
>> page cause permanent corruption?
> 
> If scans can find a page that is wholly unrelated to the expected page
> (and possibly even in the wrong high level page category), then it's
> really hard to predict what might happen. This could lead to real
> chaos.

Right. Looking at the code, I don't think that can happen. The code that 
deals with the pending list uses lock coupling, so it should never 
follow a pointer to a page that was concurrently deleted. Might 
something funny happen if a pending list page is deleted and immediately 
reused as another kind of page? I don't see a problem with that. And a 
gin btree page won't be reused prematurely as a pending list page, 
because ginNewBuffer() checks GinPageIsRecyclable(), even if the page is 
being reused as a pending list page.

> ginInsertCleanup() makes no attempt to perform basic validation
> of its assumptions about what kind of page this is, except for some
> assertions. We should have something like a "can't happen" error on
> !GinPageIsList() inside ginInsertCleanup() -- if we had that already
> then I might be able to reason about this problem. It wouldn't hurt to
> have similar checks in other code that deals with posting trees and
> entry trees, too.

+1

While playing with the trace Pawel sent, I loaded some of the index 
pages into a local instance so that I could step through the code with 
gdb. I filled the rest of the pages with zeros. I managed to get the gin 
btree descend code into an infinite loop with that. When it followed a 
pointer to an entry leaf page, but that entry leaf page was substituted 
with zeros, it interpreted the all-zeros page as a page with a 
right-link to block 0. It then stepped right to read block 0 - which is 
the metapage! - and incorrectly interpreted it as an internal page. 
Amazingly, it managed to perform the binary search on it as if it 
contained index tuples, and came up wth an another block number, which I 
had also replaced with all-zeros.

Long story short, we should sprinkle a lot more sanity checks to the 
functions in ginbtree.c. It should check that the page is of expected 
kind; no stepping from internal entry page to posting tree page. And 
seeing a pointer to block 0 at any step should be an error.

But this was just an artifact of the missing pages my test. It doesn't 
explain the original problem.

> ginInsertCleanup() is tolerant of all kinds of things. It's not just
> the lack of page-level sanity checks. It's also the basic approach to
> crash safety, which relies on the fact that GIN only does lossy index
> scans. My guess is that there could be lots of problems without it
> being obvious to users. Things really went downhill in
> ginInsertCleanup() starting in commit e956808328.

Yeah, I wouldn't be surprised if this turns out to be a bug in pending 
list cleanup. But I haven't been able to come up with a concrete 
sequence of steps that would cause it.

- Heikki



On Sat, Jul 17, 2021 at 2:27 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Right. Looking at the code, I don't think that can happen. The code that
> deals with the pending list uses lock coupling, so it should never
> follow a pointer to a page that was concurrently deleted. Might
> something funny happen if a pending list page is deleted and immediately
> reused as another kind of page? I don't see a problem with that.

I don't immediately see a problem either, but that hardly means much
-- I can't honestly claim that I understand the code here.

Why does the code in ginInsertCleanup() use lock coupling? It's
probably true that if you don't couple pending list pages then the
tests would break in an obvious way, but that tells us nothing about
the design itself. It was probably self-evident what was going on in
earlier versions of the code, back when it could only run in VACUUM
anyway, when it didn't eagerly recycle pending list pages, etc. But
it's far from obvious now that we have all that stuff too.

Let's say processPendingPage() finds a page that is fundamentally of
the wrong kind (i.e. not a pending page). Why shouldn't the result be
something scary (some kind of data corruption), that doesn't come with
an error or hard crash? I can see no reason. Small differences in the
format of the tuples seem unlikely to be caught anywhere, at least
offhand.

Another possible consequence of the lack of sanity checking is that
whole *groups* of sibling pages (say from a posting tree or the entry
tree) that look like they're a group from the pending list (but are
actually from some tree) could get corrupted, one by one. This could
happen as a secondary effect of the initial corruption, which might
have only affected one page at first. Why shouldn't ginInsertCleanup()
do that? Almost all GIN pages use "GinPageGetOpaque(page)->rightlink"
in one way or another. If this is what happened then there really is
no telling what you'll end up with. You may have a very hard time
zeroing in on the truly relevant corruption.

> And a
> gin btree page won't be reused prematurely as a pending list page,
> because ginNewBuffer() checks GinPageIsRecyclable(), even if the page is
> being reused as a pending list page.

It doesn't matter that ginNewBuffer() checks GinPageIsRecyclable(),
since the page header's pd_prune_xid field will be 0 for a pending
list page (same as other index AMs, though not the same as other GIN
code). Note that GinPageIsRecyclable() returns "true" (safe to
recycle) when it sees any "pd_prune_xid = 0" page, by design. Also, if
ginInsertCleanup() ever tries to process/delete a posting tree or
entry tree page, then it's still not going to set pd_prune_xid in the
first place (because it believes that they're pending list pages).

> While playing with the trace Pawel sent, I loaded some of the index
> pages into a local instance so that I could step through the code with
> gdb. I filled the rest of the pages with zeros. I managed to get the gin
> btree descend code into an infinite loop with that. When it followed a
> pointer to an entry leaf page, but that entry leaf page was substituted
> with zeros, it interpreted the all-zeros page as a page with a
> right-link to block 0. It then stepped right to read block 0 - which is
> the metapage! - and incorrectly interpreted it as an internal page.
> Amazingly, it managed to perform the binary search on it as if it
> contained index tuples, and came up wth an another block number, which I
> had also replaced with all-zeros.

That doesn't actually surprise me. I have deliberately corrupted a lot
of test data in all kinds of ways. I have some idea how broken things
can be internally without breaking in a very obvious way.

> Long story short, we should sprinkle a lot more sanity checks to the
> functions in ginbtree.c. It should check that the page is of expected
> kind; no stepping from internal entry page to posting tree page. And
> seeing a pointer to block 0 at any step should be an error.

Definitely.

> Yeah, I wouldn't be surprised if this turns out to be a bug in pending
> list cleanup. But I haven't been able to come up with a concrete
> sequence of steps that would cause it.

I also wouldn't be surprised if it was some unrelated issue, despite
everything. As broken as I think the code in ginInsertCleanup() is, it
has managed to be broken without anybody noticing for quite a while on
a few occasions already. So unfortunately all of my observations about
ginInsertCleanup() may in the end not help you to find the true
underlying bug.

-- 
Peter Geoghegan



On Sat, Jul 17, 2021 at 4:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > And a
> > gin btree page won't be reused prematurely as a pending list page,
> > because ginNewBuffer() checks GinPageIsRecyclable(), even if the page is
> > being reused as a pending list page.
>
> It doesn't matter that ginNewBuffer() checks GinPageIsRecyclable(),
> since the page header's pd_prune_xid field will be 0 for a pending
> list page (same as other index AMs, though not the same as other GIN
> code). Note that GinPageIsRecyclable() returns "true" (safe to
> recycle) when it sees any "pd_prune_xid = 0" page, by design. Also, if
> ginInsertCleanup() ever tries to process/delete a posting tree or
> entry tree page, then it's still not going to set pd_prune_xid in the
> first place (because it believes that they're pending list pages).

To be clear, I'm not worried about premature reuse of some other page
as a pending list page -- that should be fine. I'm worried about
unsafe early reuse of pending list pages, where the new page image is
related to either a posting tree or the entry tree. I think that it's
possible that the whole locking protocol can't handle premature
recycling, because it wasn't really built with that in mind.
shiftList() will call RecordFreeIndexPage() for pages it deletes as of
commit e956808328 from 2015. I find the code hard to follow, but I
suspect that some place cannot handle it if a
marked-GIN_DELETED-by-shiftList() page gets concurrently recycled.
Distant code may need to still have the same right link (and even the
same page contents?) to handle everything.

For example, distant code like scanGetCandidate() from ginget.c
participates in buffer lock coupling of pending list pages, and so has
interactions with ginInsertCleanup() and its own lock coupling. But
scanGetCandidate() hasn't changed since 2009, and might well have been
missed by the numerous bug fixes that changed scanGetCandidate() code
following commit e956808328 from 2015. Nearby code in
scanPendingInsert() also hasn't been changed since 2009. But
ginInsertCleanup() *has* changed since that time, by quite a bit.
Starting with making shiftList() call RecordFreeIndexPage() for pages
it deletes. But other stuff too.

I might be mistaken in certain details, but the fact that it's so hard
to keep all the details straight concerns me.

--
Peter Geoghegan



On Sat, Jul 17, 2021 at 10:51 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 16/07/2021 10:06, Pawel Kudzia wrote:
> >
> >
> > On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka@iki.fi
> > <mailto:hlinnaka@iki.fi>> wrote:
> >
> > thank you for the hint. i've messed up patching. now i can get one step
> > further:
> >
> > data=# CREATE EXTENSION amcheck;
> > CREATE EXTENSION
> > data=# gin_index_parent_check('entity_attribute_name_ids_gin');
> > ERROR:  syntax error at or near "gin_index_parent_check"
> > LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin');
> >          ^
>
> Ah, sorry, you need to call it with SELECT, like:
>
> SELECT gin_index_parent_check('entity_attribute_name_ids_gin');
>

That worked. It gave me gin_index_parent_check.txt, attached.

> > i've also applied trace-gin-readbuffer-2.patch
>
> Darn, I missed one function that's used to read a page when descending
> the GIN tree. That explains the seemingly nonsensical accesses in the
> log - the trace left out some crucial accesses.
>
> Attached is a new version of that debugging patch. Please repeat the
> same steps as before with this:
>
> 1. Apply the patch (remove previous patch first if necessary)
>
> 2. Compile Postgres, "make install"
>
> 3. Clear the log, start postgres
>
> 4. Run the query again:
>
> SELECT ctid, entity_id FROM entity WHERE
> ( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
> && '{1737}') LIMIT 10;
>

for clarity - i'm also running "set enable_seqscan=off" before that
select to ensure that GIN is used.

SELECT gave me

     ctid     | entity_id
--------------+-----------
 (4002784,1)  |  38048120
 (4002869,14) |  95333744
(2 rows)



> 5. Stop the server.
>
> 6. Extract the content of the accessed index blocks:
>
> cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print
> "$1\n" ' | sort -n |uniq > /tmp/blocknums
>
> psql data -c "create temp table blocknums(blknum int); copy blocknums
> from '/tmp/blocknums'; copy (select blknum,
> get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to
> '/tmp/block-contents';"
>
> 7. Send over /tmp/blocknums, /tmp/block-contents and the postgres log again.
>

I'm sending those over directly to your mail.

> Thank you for your patience!
>

Thanks a lot for the investigation!


-- 
regards,
Pawel Kudzia

Вложения

Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 18/07/2021 11:27, Pawel Kudzia wrote:
> I'm sending those over directly to your mail.

Thanks! I looked through those logs, and I'm now not seeing anything 
wrong with the pages involved. The query first scans the pending list, 
and doesn't find any matches there. It then descends the entry tree, 
finds pointer to a posting tree for id 1373, then scans the posting 
tree. For some reason, the TIDs of those offending rows, (4002784,1) and 
(4002869,14) are in the posting tree for key '1373', even though the 
heap tuples don't contain that key. Both of those TIDs are in the same 
TID list, on page 255179:

> postgres=# select * from (select blknum, ((gin_leafpage_items(content)).*) from ginblocks where blknum=255179) x
wheretids::text like '%(4002869,14)%';
 
>  blknum |  first_tid   | nbytes |
                                                                       
 
>
                                    tids                               
 
>
                                                                       
 
>                                                                                            
>
--------+--------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> -------------------------------------------------------------------------------------------
>  255179 | (4002720,16) |     87 |
{"(4002720,16)","(4002721,3)","(4002723,5)","(4002739,11)","(4002743,14)","(4002744,2)","(4002784,1)","(4002784,10)","(4002790,46)","(4002790,47)","(4002792
>
,22)","(4002792,23)","(4002792,24)","(4002796,22)","(4002798,2)","(4002800,7)","(4002802,27)","(4002802,28)","(4002803,22)","(4002805,28)","(4002816,13)","(4002817,19)","(4002820,2)","(40028
>
25,17)","(4002854,7)","(4002855,13)","(4002855,14)","(4002869,14)","(4006217,3)","(4006217,11)","(4006307,11)","(4006330,4)","(4006330,8)","(4006331,6)","(4006332,9)","(4006332,10)","(400633
> 3,9)","(4006335,2)","(4006338,1)","(4006342,7)","(4006343,5)","(4006407,2)","(4008912,2)"}
> (1 row)
I've used pg_filedump and pageinspect on all of the pages on the path to 
this entry, but I don't see anything wrong with the index.

So how did it end up like that? I'm afraid I have no clue :-(. Peter's 
hunch that it's some bad interaction between deleted pages and the 
pending list sounds plausible, but looking at the code, I don't see any 
concrete bugs like that.

We have one more lead that we haven't fully explored: amcheck reported 
this error:

ERROR:  index "entity_attribute_name_ids_gin": tid exceeds parent's high 
key in postingTree leaf on block 321746

Give that the GIN amheck patch is still very experimental, that could 
well be a bug in amcheck instead of real corruption. Also, that error 
was *not* in the posting tree for key '1373'. I hacked amcheck on my 
laptop to run only on the posting tree or '1373', and it didn't find any 
errors there.

But let's check it out. I modified 'amcheck' to print more information. 
I also changed it so that it doesn't stop at the first error, but keeps 
going, and reports all issues as warnings.



Pawel: Could you try the attached version of amcheck? Please apply the 
attached patch (you'll have to un-apply the previous version or start 
from a clean source tree), run "SELECT 
gin_index_parent_check('entity_attribute_name_ids_gin');" again, and 
send over the result? It should print a bunch of INFOs and WARNINGs to 
the client's stderr, so a good way to capture that is something like this:

psql -c "select 
gin_index_parent_check('entity_attribute_name_ids_gin');" 2> amcheck-log


Let's see if that reveals something interesting. If not, what can we do 
next?

1. Let's fix the issues with the triconsistent functions that we 
discussed earlier. They don't explain what we're seeing here, but should 
be cleaned up nevertheless.

2. Let's add more sanity checks wherever we read a GIN page, to check 
that it's of the expected kind. Again we have no theory on what sequence 
of events could cause this, but having more sanity checks seems like a 
good idea in any case.

3. Run the test case again and wait for the issue to re-appear. This 
time, let's try to capture the WAL, so that we can trace back the 
modifications to the pages that lead to the situation.

Any other ideas?

- Heikki

Вложения
On Tue, Jul 20, 2021 at 1:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
[..]
>
>
> Pawel: Could you try the attached version of amcheck? Please apply the
> attached patch (you'll have to un-apply the previous version or start
> from a clean source tree), run "SELECT
> gin_index_parent_check('entity_attribute_name_ids_gin');" again, and
> send over the result? It should print a bunch of INFOs and WARNINGs to
> the client's stderr, so a good way to capture that is something like this:
>
> psql -c "select
> gin_index_parent_check('entity_attribute_name_ids_gin');" 2> amcheck-log
>

i'm attaching log of postgresql generated while running the command
above and amcheck-log.

thanks a lot!


-- 
regards,
Pawel Kudzia

Вложения

Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 20/07/2021 15:22, Pawel Kudzia wrote:
> On Tue, Jul 20, 2021 at 1:55 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Pawel: Could you try the attached version of amcheck? Please apply the
>> attached patch (you'll have to un-apply the previous version or start
>> from a clean source tree), run "SELECT
>> gin_index_parent_check('entity_attribute_name_ids_gin');" again, and
>> send over the result? It should print a bunch of INFOs and WARNINGs to
>> the client's stderr, so a good way to capture that is something like this:
>>
>> psql -c "select
>> gin_index_parent_check('entity_attribute_name_ids_gin');" 2> amcheck-log
>>
> 
> i'm attaching log of postgresql generated while running the command
> above and amcheck-log.
> 
> thanks a lot!

Great! So it's reporting two kinds of issues. First this:

WARNING:  index "entity_attribute_name_ids_gin": tid exceeds parent's 
high key in postingTree leaf on block 343573

I looked at the amcheck code again, and that is in fact bogus; there is 
nothing wrong with the high keys, but there is a memory management bug 
in the amcheck patch that's causing it.

The other error it complains about looks like this:

...
INFO:  key (1743225, 6) -> 363962
INFO:  key (1746295, 15) -> 341281
INFO:  key (1747712, 12) -> 343607
INFO:  key (1749215, 4) -> 376056
INFO:  key (262563, 1) -> 11547
INFO:  key (3906513777, 43269) -> 31653888
INFO:  key (536354808, 8196) -> 4432
ERROR:  index "entity_attribute_name_ids_gin" has wrong tuple order in 
posting tree, block 371806, offset 818

Looking at the amcheck code for that check, that's a false alarm too. 
amcheck is calculating the size of this array incorrectly, thinking that 
there are more entries on the page than there really are. The entries 
above up to (1749215, 4) are valid, but the rest are not real entries, 
just amcheck's misinterpretation. Unfortunately 'amcheck' stopped at 
that error again.

Sigh. I fixed those issues in amcheck again. And also added some more 
checks on the high keys. At least we're making the amcheck patch better, 
even if we're not any wiser about the original issue...

Pawel: Can you test once again with the attached amcheck version?

- Heikki

Вложения
On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
[..]
>
> Looking at the amcheck code for that check, that's a false alarm too.
> amcheck is calculating the size of this array incorrectly, thinking that
> there are more entries on the page than there really are. The entries
> above up to (1749215, 4) are valid, but the rest are not real entries,
> just amcheck's misinterpretation. Unfortunately 'amcheck' stopped at
> that error again.
>
> Sigh. I fixed those issues in amcheck again. And also added some more
> checks on the high keys. At least we're making the amcheck patch better,
> even if we're not any wiser about the original issue...
>
> Pawel: Can you test once again with the attached amcheck version?


patch v4-0001-Amcheck-for-GIN-13stable.patch applied to
postgresql-13_13.3.orig.tar.bz2

since the compressed logs are quite large - i've sent them directly to Heikki.

i've executing "select
gin_index_parent_check('entity_attribute_name_ids_gin');" took
considerably longer this time and it ended with :

[524767.920498] postgres[29716]: segfault at fffffffffffffffe ip
00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in
amcheck.so[7f214c209000+6000]
[524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49
89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46
06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7
43 04

>
> - Heikki



-- 
regards,
Pawel Kudzia



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 20/07/2021 18:58, Pawel Kudzia wrote:
> On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Pawel: Can you test once again with the attached amcheck version?
> 
> patch v4-0001-Amcheck-for-GIN-13stable.patch applied to
> postgresql-13_13.3.orig.tar.bz2
> attached - log and result of the check.
> 
> executing "select
> gin_index_parent_check('entity_attribute_name_ids_gin');" took
> considerably longer this time and it ended with :
> 
> [524767.920498] postgres[29716]: segfault at fffffffffffffffe ip
> 00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in
> amcheck.so[7f214c209000+6000]
> [524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49
> 89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46
> 06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7
> 43 04

Hmph, another set of bugs in the amcheck patch. It didn't handle empty 
entry tree items, nor empty posting tree pages. And there was another 
bug, which caused it to scan entry tree pages multiple times; that's why 
the same WARNING was printed multiple times.

Fixed those bugs, new patch version attached. Pawel, can you test this 
again, please? At this point, I'm pretty sure this isn't going to reveal 
any more information about the original problem, but at least we're 
ironing out bugs from the 'amcheck' patch..

I'm grasping at straws here, but here's one more thing we could try: the 
query returned these incorrect tuples:

      ctid     | entity_id
--------------+-----------
  (4002784,1)  |  38048120
  (4002869,14) |  95333744
(2 rows)

We know those entries are in the GIN index with key '1373', when they 
shouldn't be. But I wonder if the correct keys for those tuples are 
present? Pawel, can you try this, please:

-- persuade planner to use a bitmap index scan
set enable_seqscan=off;
set enable_tidscan=off;
-- avoid lossy bitmaps
set work_mem='1GB';

-- Explain analyze first to check it's using a bitmap index scan and
-- no lossy pages.
explain analyze
SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && 
'{38048120}' and ctid='(4002784,1)';
SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && 
'{38048120}' and ctid='(4002784,1)';

explain analyze
SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && 
'{95333744}' and ctid='(4002869,14)';
SELECT ctid, entity_id FROM entity WHERE attribute_name_ids && 
'{95333744}' and ctid='(4002869,14)';


- Heikki

Вложения
On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 20/07/2021 18:58, Pawel Kudzia wrote:
> > On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> Pawel: Can you test once again with the attached amcheck version?
> >
> > patch v4-0001-Amcheck-for-GIN-13stable.patch applied to
> > postgresql-13_13.3.orig.tar.bz2
> > attached - log and result of the check.
> >
> > executing "select
> > gin_index_parent_check('entity_attribute_name_ids_gin');" took
> > considerably longer this time and it ended with :
> >
> > [524767.920498] postgres[29716]: segfault at fffffffffffffffe ip
> > 00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in
> > amcheck.so[7f214c209000+6000]
> > [524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49
> > 89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46
> > 06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7
> > 43 04
>
> Hmph, another set of bugs in the amcheck patch. It didn't handle empty
> entry tree items, nor empty posting tree pages. And there was another
> bug, which caused it to scan entry tree pages multiple times; that's why
> the same WARNING was printed multiple times.
>
> Fixed those bugs, new patch version attached. Pawel, can you test this
> again, please? At this point, I'm pretty sure this isn't going to reveal
> any more information about the original problem, but at least we're
> ironing out bugs from the 'amcheck' patch..

thank you for the next patch Heikki!
no crash this time! i'm sending a message in a separate mail since i'm
not sure if it'll pass through attachment size limit that's applied
for the list.

> I'm grasping at straws here, but here's one more thing we could try: the
> query returned these incorrect tuples:
>
>       ctid     | entity_id
> --------------+-----------
>   (4002784,1)  |  38048120
>   (4002869,14) |  95333744
> (2 rows)
>
> We know those entries are in the GIN index with key '1373', when they
> shouldn't be. But I wonder if the correct keys for those tuples are
> present? Pawel, can you try this, please:
>
> -- persuade planner to use a bitmap index scan
> set enable_seqscan=off;
> set enable_tidscan=off;
> -- avoid lossy bitmaps
> set work_mem='1GB';
>
> -- Explain analyze first to check it's using a bitmap index scan and
> -- no lossy pages.
> explain analyze
> SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> '{38048120}' and ctid='(4002784,1)';
> SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> '{38048120}' and ctid='(4002784,1)';
>
> explain analyze
> SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> '{95333744}' and ctid='(4002869,14)';
> SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> '{95333744}' and ctid='(4002869,14)';


data=# explain analyze SELECT ctid, entity_id FROM entity WHERE
attribute_name_ids && '{38048120}' and ctid='(4002784,1)';
                                                                 QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on entity  (cost=72.70..3366.69 rows=1 width=14)
(actual time=42.995..42.998 rows=0 loops=1)
   Recheck Cond: (attribute_name_ids && '{38048120}'::integer[])
   Filter: (ctid = '(4002784,1)'::tid)
   ->  Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..72.70 rows=33055 width=0) (actual time=42.989..42.990
rows=0 loops=1)
         Index Cond: (attribute_name_ids && '{38048120}'::integer[])
 Planning Time: 18.912 ms
 Execution Time: 43.515 ms
(7 rows)

data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
'{38048120}' and ctid='(4002784,1)';
 ctid | entity_id
------+-----------
(0 rows)

data=# explain analyze SELECT ctid, entity_id FROM entity WHERE
attribute_name_ids && '{95333744}' and ctid='(4002869,14)';
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on entity  (cost=72.70..3366.69 rows=1 width=14)
(actual time=2.552..2.554 rows=0 loops=1)
   Recheck Cond: (attribute_name_ids && '{95333744}'::integer[])
   Filter: (ctid = '(4002869,14)'::tid)
   ->  Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..72.70 rows=33055 width=0) (actual time=2.543..2.544 rows=0
loops=1)
         Index Cond: (attribute_name_ids && '{95333744}'::integer[])
 Planning Time: 0.193 ms
 Execution Time: 2.594 ms
(7 rows)

data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
'{95333744}' and ctid='(4002869,14)';
 ctid | entity_id
------+-----------
(0 rows)




>
> - Heikki



-- 
regards,
Pawel Kudzia



On Thu, Jul 22, 2021 at 11:07 AM Pawel Kudzia <kudzia@gmail.com> wrote:
>
> On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 20/07/2021 18:58, Pawel Kudzia wrote:
> > > On Tue, Jul 20, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >> Pawel: Can you test once again with the attached amcheck version?
> > >
> > > patch v4-0001-Amcheck-for-GIN-13stable.patch applied to
> > > postgresql-13_13.3.orig.tar.bz2
> > > attached - log and result of the check.
> > >
> > > executing "select
> > > gin_index_parent_check('entity_attribute_name_ids_gin');" took
> > > considerably longer this time and it ended with :
> > >
> > > [524767.920498] postgres[29716]: segfault at fffffffffffffffe ip
> > > 00007f214c20a2d8 sp 00007ffe2a5fd040 error 5 in
> > > amcheck.so[7f214c209000+6000]
> > > [524767.920521] Code: 00 00 41 0f b7 4e 04 41 8b 56 0c 41 8b 76 10 49
> > > 89 c1 48 8d 04 40 c1 e1 10 45 0f b7 46 08 48 8d 7c 43 fa 41 0f b7 46
> > > 06 09 c1 <0f> b7 47 04 50 0f b7 07 0f b7 7f 02 c1 e0 10 09 f8 50 0f b7
> > > 43 04
> >
> > Hmph, another set of bugs in the amcheck patch. It didn't handle empty
> > entry tree items, nor empty posting tree pages. And there was another
> > bug, which caused it to scan entry tree pages multiple times; that's why
> > the same WARNING was printed multiple times.
> >
> > Fixed those bugs, new patch version attached. Pawel, can you test this
> > again, please? At this point, I'm pretty sure this isn't going to reveal
> > any more information about the original problem, but at least we're
> > ironing out bugs from the 'amcheck' patch..
>
> thank you for the next patch Heikki!
> no crash this time! i'm sending a message in a separate mail since i'm
> not sure if it'll pass through attachment size limit that's applied
> for the list.

there goes the attachment.


>
> > I'm grasping at straws here, but here's one more thing we could try: the
> > query returned these incorrect tuples:
> >
> >       ctid     | entity_id
> > --------------+-----------
> >   (4002784,1)  |  38048120
> >   (4002869,14) |  95333744
> > (2 rows)
> >
> > We know those entries are in the GIN index with key '1373', when they
> > shouldn't be. But I wonder if the correct keys for those tuples are
> > present? Pawel, can you try this, please:
> >
> > -- persuade planner to use a bitmap index scan
> > set enable_seqscan=off;
> > set enable_tidscan=off;
> > -- avoid lossy bitmaps
> > set work_mem='1GB';
> >
> > -- Explain analyze first to check it's using a bitmap index scan and
> > -- no lossy pages.
> > explain analyze
> > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> > '{38048120}' and ctid='(4002784,1)';
> > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> > '{38048120}' and ctid='(4002784,1)';
> >
> > explain analyze
> > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> > '{95333744}' and ctid='(4002869,14)';
> > SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> > '{95333744}' and ctid='(4002869,14)';
>
>
> data=# explain analyze SELECT ctid, entity_id FROM entity WHERE
> attribute_name_ids && '{38048120}' and ctid='(4002784,1)';
>                                                                  QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on entity  (cost=72.70..3366.69 rows=1 width=14)
> (actual time=42.995..42.998 rows=0 loops=1)
>    Recheck Cond: (attribute_name_ids && '{38048120}'::integer[])
>    Filter: (ctid = '(4002784,1)'::tid)
>    ->  Bitmap Index Scan on entity_attribute_name_ids_gin
> (cost=0.00..72.70 rows=33055 width=0) (actual time=42.989..42.990
> rows=0 loops=1)
>          Index Cond: (attribute_name_ids && '{38048120}'::integer[])
>  Planning Time: 18.912 ms
>  Execution Time: 43.515 ms
> (7 rows)
>
> data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> '{38048120}' and ctid='(4002784,1)';
>  ctid | entity_id
> ------+-----------
> (0 rows)
>
> data=# explain analyze SELECT ctid, entity_id FROM entity WHERE
> attribute_name_ids && '{95333744}' and ctid='(4002869,14)';
>                                                                 QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on entity  (cost=72.70..3366.69 rows=1 width=14)
> (actual time=2.552..2.554 rows=0 loops=1)
>    Recheck Cond: (attribute_name_ids && '{95333744}'::integer[])
>    Filter: (ctid = '(4002869,14)'::tid)
>    ->  Bitmap Index Scan on entity_attribute_name_ids_gin
> (cost=0.00..72.70 rows=33055 width=0) (actual time=2.543..2.544 rows=0
> loops=1)
>          Index Cond: (attribute_name_ids && '{95333744}'::integer[])
>  Planning Time: 0.193 ms
>  Execution Time: 2.594 ms
> (7 rows)
>
> data=# SELECT ctid, entity_id FROM entity WHERE attribute_name_ids &&
> '{95333744}' and ctid='(4002869,14)';
>  ctid | entity_id
> ------+-----------
> (0 rows)
>
>
>
>
> >
> > - Heikki
>
>
>
> --
> regards,
> Pawel Kudzia



-- 
regards,
Pawel Kudzia

Вложения

Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 22/07/2021 12:07, Pawel Kudzia wrote:
> On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> Fixed those bugs, new patch version attached. Pawel, can you test this
>> again, please? At this point, I'm pretty sure this isn't going to reveal
>> any more information about the original problem, but at least we're
>> ironing out bugs from the 'amcheck' patch..
> 
> thank you for the next patch Heikki!
> no crash this time! i'm sending a message in a separate mail since i'm
> not sure if it'll pass through attachment size limit that's applied
> for the list.

Thanks! So looking at the log, amcheck is not reporting any more problems.

>> I'm grasping at straws here, but here's one more thing we could try: the
>> query returned these incorrect tuples:
>>
>>        ctid     | entity_id
>> --------------+-----------
>>    (4002784,1)  |  38048120
>>    (4002869,14) |  95333744
>> (2 rows)
>>
>> We know those entries are in the GIN index with key '1373', when they
>> shouldn't be. But I wonder if the correct keys for those tuples are
>> present? Pawel, can you try this, please:
> 
> [queries with nore rows returned]

Ok, so the index is missing entries for the correct key. Looks like the 
index entries were inserted into the wrong subtree, under wrong key. But 
*how* did that happen? I'm out of ideas, I'm afraid :-(.

- Heikki



On Fri, Jul 23, 2021 at 3:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 22/07/2021 12:07, Pawel Kudzia wrote:
> > On Thu, Jul 22, 2021 at 9:25 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> Fixed those bugs, new patch version attached. Pawel, can you test this
> >> again, please? At this point, I'm pretty sure this isn't going to reveal
> >> any more information about the original problem, but at least we're
> >> ironing out bugs from the 'amcheck' patch..
> >
> > thank you for the next patch Heikki!
> > no crash this time! i'm sending a message in a separate mail since i'm
> > not sure if it'll pass through attachment size limit that's applied
> > for the list.
>
> Thanks! So looking at the log, amcheck is not reporting any more problems.
>
> >> I'm grasping at straws here, but here's one more thing we could try: the
> >> query returned these incorrect tuples:
> >>
> >>        ctid     | entity_id
> >> --------------+-----------
> >>    (4002784,1)  |  38048120
> >>    (4002869,14) |  95333744
> >> (2 rows)
> >>
> >> We know those entries are in the GIN index with key '1373', when they
> >> shouldn't be. But I wonder if the correct keys for those tuples are
> >> present? Pawel, can you try this, please:
> >
> > [queries with nore rows returned]
>
> Ok, so the index is missing entries for the correct key. Looks like the
> index entries were inserted into the wrong subtree, under wrong key. But
> *how* did that happen? I'm out of ideas, I'm afraid :-(.

Thanks a lot for your patience and multiple patches that you've
provided. Please pardon my ignorance - I don't have low-level
understanding of how the query is being executed - but are you sure
that index is missing entries and not the other way around - that it
has too many entries?

To recap - SELECT, answered based on the GIN, reports rows that
actually do not match the criteria provided in WHERE. Just lowering
work_mem makes the problem go away, whith GIN still being used.

Greetings!




-- 
regards,
Pawel Kudzia



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 23 July 2021 18:04:50 EEST, Pawel Kudzia <kudzia@gmail.com> wrote:
>On Fri, Jul 23, 2021 at 3:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> Ok, so the index is missing entries for the correct key. Looks like the
>> index entries were inserted into the wrong subtree, under wrong key. But
>> *how* did that happen? I'm out of ideas, I'm afraid :-(.

Can you tell more about the workload? What INSERT/UPDATE/DELETE commands do you run? How many concurrent clients? Do
yourun vacuum manually or how often does autovacuum kick in? How long did it take for the problem to appear? What
queriesdid you run to find the corruption, and how often? 

I know it's a big ask, but would it be possible to simplify the test case further, to make it reproduce faster? Ideally
asa self-contained test script with any sensitive data removed. 

- Heikki



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Heikki Linnakangas
Дата:
On 23/07/2021 18:04, Pawel Kudzia wrote:
> Thanks a lot for your patience and multiple patches that you've
> provided. Please pardon my ignorance - I don't have low-level
> understanding of how the query is being executed - but are you sure
> that index is missing entries and not the other way around - that it
> has too many entries?

To be precise, the index has an extra entry for row (4002784,1) with key 
1373, and it's missing the entry with key 38048120. And for row 
(4002869,14), it has an extra row for key 1373, and it's missing the 
entry for key 95333744.

> To recap - SELECT, answered based on the GIN, reports rows that
> actually do not match the criteria provided in WHERE. Just lowering
> work_mem makes the problem go away, whith GIN still being used.

The reason that lowering work_mem hides the problem is that GIN collects 
all the matches in a so called TID bitmap, and if the bitmap grows too 
large compared to work_mem, it becomes lossy to save memory. When it's 
lossy, it only stores the heap block numbers of the matches. For a 
regular, non-lossy, match, the Bitmap Heap Scan just returns the row 
that the index says is a match. For the lossy matches, the Bitmap Heap 
Scan node needs to check every row on the page to see which ones 
actually match. This re-checking hides the problem that some of the 
matches that the index reported were not real.

Note that you are also getting incorrect results with missing row for 
other queries. You can try it with e.g.:

-- Using the index
set enable_seqscan=off;
set enable_bitmapscan=on;
SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}';
SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}';

-- Without index
set enable_seqscan=on;
set enable_bitmapscan=off;
SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}';
SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}';


I'll work on a patch to add more sanity checks to the GIN code when it 
traverses the tree, to catch the case that it accidentally steps on a 
wrong kind of a page (I'm pretty busy next week, so might not get to 
that until the week after though). I don't think that will help here, 
but who knows, and at least we can rule out some kinds of bugs. 
Alexander, can you finish the fixes to the ternary logic? That doesn't 
explain this corruption either, but we should fix it anyway.

- Heikki



On Fri, Jul 23, 2021 at 11:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 23 July 2021 18:04:50 EEST, Pawel Kudzia <kudzia@gmail.com> wrote:
> >On Fri, Jul 23, 2021 at 3:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> Ok, so the index is missing entries for the correct key. Looks like the
> >> index entries were inserted into the wrong subtree, under wrong key. But
> >> *how* did that happen? I'm out of ideas, I'm afraid :-(.
>
> Can you tell more about the workload? What INSERT/UPDATE/DELETE commands do you run? How many concurrent clients? Do
yourun vacuum manually or how often does autovacuum kick in? How long did it take for the problem to appear? What
queriesdid you run to find the corruption, and how often? 
>

It's a mix with low percentage of simple INSERT INTO entity [
inserting single row within a transaction ], DELETE FROM entity [
deleting single row, selected by primary key ] and high number of
UPDATE entity SET .. WHERE primname=?.
There are few millions such updates a day. Single transaction has just
one such update, update sets most of the table columns although in
fact most of their values remain unchanged.

Concurrency level  - less than 50 parallel connections.

Vacuum is run manually extremely rarely [ maybe 2-3x a year ].

Based on recent logs - autovacuum happens every ~ 7 days.

Problem was discovered by coincidence - one of users noticed incorrect
results returned, this provoked me to run a script that hammered
database every hour with set of queries in this form:

SELECT primname FROM entity WHERE ( attribute_name_ids && '{{$ani}}' )
AND NOT ( (attribute_name_ids||0) && '{{$ani}}')

It would discover ~ 1-5 problematic entries each week, but there were
periods of few weeks where new problems were not found [ note that
this does not mean there were no such problems, i was hammering only
one of GIN columns - where i had smallest universe of values to
iterate over ].



> I know it's a big ask, but would it be possible to simplify the test case further, to make it reproduce faster?
Ideallyas a self-contained test script with any sensitive data removed. 
>

Sadly - I've put quite a lot of effort to reproduce the issue in test
environment and was never able to do it. Problem occured on the
production only, after prolonged execution of live update traffic.
Note that in the I've moved database to another physical server to
eliminate potential hardware malfunction - that did not help, problem
continued to occur. I've also upgraded from 12.5 to 13.3.

I no longer have a production setup that uses index built in this way:

CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids public.gin__int_ops);

2 weeks ago i've moved to

CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids);

And have not seen the problem manifest again.

Nevertheless - I have 2 file-level dumps of databases where we caught
the inconsistency and would like to help with diagnosing the problem.



--
regards,
Pawel Kudzia



On Sun, Jul 25, 2021 at 9:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 23/07/2021 18:04, Pawel Kudzia wrote:
> > Thanks a lot for your patience and multiple patches that you've
> > provided. Please pardon my ignorance - I don't have low-level
> > understanding of how the query is being executed - but are you sure
> > that index is missing entries and not the other way around - that it
> > has too many entries?
>
> To be precise, the index has an extra entry for row (4002784,1) with key
> 1373, and it's missing the entry with key 38048120. And for row
> (4002869,14), it has an extra row for key 1373, and it's missing the
> entry for key 95333744.
>
> > To recap - SELECT, answered based on the GIN, reports rows that
> > actually do not match the criteria provided in WHERE. Just lowering
> > work_mem makes the problem go away, whith GIN still being used.
>
> The reason that lowering work_mem hides the problem is that GIN collects
> all the matches in a so called TID bitmap, and if the bitmap grows too
> large compared to work_mem, it becomes lossy to save memory. When it's
> lossy, it only stores the heap block numbers of the matches. For a
> regular, non-lossy, match, the Bitmap Heap Scan just returns the row
> that the index says is a match. For the lossy matches, the Bitmap Heap
> Scan node needs to check every row on the page to see which ones
> actually match. This re-checking hides the problem that some of the
> matches that the index reported were not real.

Thank you for the explanation!


>
> Note that you are also getting incorrect results with missing row for
> other queries. You can try it with e.g.:
>
> -- Using the index
> set enable_seqscan=off;
> set enable_bitmapscan=on;
> SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}';
> SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}';
>
> -- Without index
> set enable_seqscan=on;
> set enable_bitmapscan=off;
> SELECT count(*) FROM entity WHERE attribute_name_ids && '{95333744}';
> SELECT count(*) FROM entity WHERE attribute_name_ids && '{38048120}';
>
>

Actually - both give identical results - count(*) = 0.

I think you actually wanted me to run those:

data=# set enable_seqscan=off;
SET
data=# set enable_bitmapscan=on;
SET
data=# SELECT count(*) FROM entity WHERE attribute_name_ids && '{1737}';
 count
-------
 79565
(1 row)

data=# set enable_seqscan=on;
SET
data=# set enable_bitmapscan=off;
SET
data=# SELECT count(*) FROM entity WHERE attribute_name_ids && '{1737}';
 count
-------
 79560
(1 row)

Results indeed differ.

> I'll work on a patch to add more sanity checks to the GIN code when it
> traverses the tree, to catch the case that it accidentally steps on a
> wrong kind of a page (I'm pretty busy next week, so might not get to
> that until the week after though). I don't think that will help here,
> but who knows, and at least we can rule out some kinds of bugs.
> Alexander, can you finish the fixes to the ternary logic? That doesn't
> explain this corruption either, but we should fix it anyway.
>

Thanks a lot! I'm happy to test.

Greetings!


-- 
regards,
Pawel Kudzia



Just to make sure: are you using any kind of storage or other virtualization, or a SAN? And what is your approach to backups? 

Peter Geoghegan
(Sent from my phone)
On Sun, Jul 25, 2021 at 10:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> Just to make sure: are you using any kind of storage or other virtualization, or a SAN? And what is your approach to
backups?
>
> Peter Geoghegan
> (Sent from my phone)

It's locally attached storage:

2x NVMe installed in the server
on them mdadm RAID1
on that LUKS full disk encryption
on that ext4

please note that exactly the same storage / partition holds also MySQL
and Sphinxsearch databases - all those run at the same server, under
different users. all of them [ PG, MySQL, Sphinx ] are under heavy
read traffic and update traffic comparable to one described above for
postgresql - there are no consistency issues in Sphinx, MySQL
databases.

postgresql backups are done with pg_dumpall every 24h, we also use
streaming replication so servers where inconsistencies were found are
also masters for 5 replicas.




-- 
regards,
Pawel Kudzia



Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Pawel Kudzia
Дата:
On Thu, Jun 17, 2021 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Pawel Kudzia <kudzia@gmail.com> writes:
> > with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> > or less makes the problem go away:
>
> Hmm.  So that suggests that the index itself is *not* corrupt,
> but the problem is associated with a bug in the indexscan
> algorithms.
>
> Have you experimented with different index opclasses?  Your
> original report said you were using gin__int_ops, but that's
> not the default, and we sort of deprecate intarray these days.

Hello,

Three weeks ago we've changed from

CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids public.gin__int_ops);

into

CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids);

Sadly - the problems are back.

I have:
* queries that return incorrect responses:
SET enable_seqscan = OFF;
SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1485}' )
AND NOT ( (attribute_name_ids||0) && '{1485}') LIMIT 10;
SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{144}' )
AND NOT ( (attribute_name_ids||0) && '{144}') LIMIT 10;
SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' )
AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10;

all three return:
 entity_id |      primname
-----------+--------------------
  99311962 | msci_674591ltepsgt
  99312880 | msci_674667mgu

data=# explain analyze SELECT entity_id FROM entity WHERE (
attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) &&
'{2652}') LIMIT 10;
                                                                     QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=30.59..31.71 rows=10 width=8) (actual
time=149.988..200.749 rows=2 loops=1)
   ->  Bitmap Heap Scan on entity  (cost=30.59..3798.81 rows=33644
width=8) (actual time=149.985..200.744 rows=2 loops=1)
         Recheck Cond: (attribute_name_ids && '{2652}'::integer[])
         Rows Removed by Index Recheck: 79741
         Filter: (NOT ((attribute_name_ids || 0) && '{2652}'::integer[]))
         Rows Removed by Filter: 16723
         Heap Blocks: exact=8277 lossy=9989
         ->  Bitmap Index Scan on entity_attribute_name_ids_gin
(cost=0.00..30.59 rows=33813 width=0) (actual time=9.712..9.713
rows=37367 loops=1)
               Index Cond: (attribute_name_ids && '{2652}'::integer[])
 Planning Time: 0.472 ms
 Execution Time: 200.977 ms
(11 rows)


* first time when hourly consistency check reported the inconsistency
[ 2021-08-01 19:04:15 for entity_id=99312880, 2021-08-01 23:04:15 for
entity_id=99311962 ]
* timestamps when affected row has been updated [ entity_id=99312880:
2021-08-01 16:51:59, 2021-08-01 19:04:06 [ most likely that update led
to inconsistency ], 2021-08-01 23:04:14, 2021-08-02 01:08:18,
2021-08-02 05:12:15, entity_id=99311962 2021-08-01 16:51:59,
2021-08-01 19:04:05, 2021-08-01 23:04:13 [ likely culprit ],
2021-08-02 01:08:16 ]
* wal files from the **master server** from the time when the
inconsistency started to occur [ from 2021-08-01 14:50:37 till
2021-08-02 04:31:00 ; sadly i don't have enough to cover the whole
period between snapshots mentioned below  ]
* file-level snapshot of the database taken from streaming
**replication slave** before the problem started to occur [ 2021-08-01
12:00:01 ] and after it occured [ 2021-08-02 00:00:01 ]

I'll be happy to run any diagnostics needed, provide access to
particular wal file.

Greetings!

-- 
regards,
Pawel Kudzia



Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Pawel Kudzia
Дата:
On Mon, Aug 2, 2021 at 10:07 AM Pawel Kudzia <kudzia@gmail.com> wrote:
>
> On Thu, Jun 17, 2021 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Pawel Kudzia <kudzia@gmail.com> writes:
> > > with help from IRC we've found that decreasing work_mem from 1MB to 256kB
> > > or less makes the problem go away:
> >
> > Hmm.  So that suggests that the index itself is *not* corrupt,
> > but the problem is associated with a bug in the indexscan
> > algorithms.
> >
> > Have you experimented with different index opclasses?  Your
> > original report said you were using gin__int_ops, but that's
> > not the default, and we sort of deprecate intarray these days.
>
> Hello,
>
> Three weeks ago we've changed from
>
> CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
> (attribute_name_ids public.gin__int_ops);
>
> into
>
> CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
> (attribute_name_ids);
>
> Sadly - the problems are back.
>
> I have:
> * queries that return incorrect responses:
> SET enable_seqscan = OFF;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{1485}' )
> AND NOT ( (attribute_name_ids||0) && '{1485}') LIMIT 10;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{144}' )
> AND NOT ( (attribute_name_ids||0) && '{144}') LIMIT 10;
> SELECT entity_id FROM entity WHERE ( attribute_name_ids && '{2652}' )
> AND NOT ( (attribute_name_ids||0) && '{2652}') LIMIT 10;
>
> all three return:
>  entity_id |      primname
> -----------+--------------------
>   99311962 | msci_674591ltepsgt
>   99312880 | msci_674667mgu
>
> data=# explain analyze SELECT entity_id FROM entity WHERE (
> attribute_name_ids && '{2652}' ) AND NOT ( (attribute_name_ids||0) &&
> '{2652}') LIMIT 10;
>                                                                      QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=30.59..31.71 rows=10 width=8) (actual
> time=149.988..200.749 rows=2 loops=1)
>    ->  Bitmap Heap Scan on entity  (cost=30.59..3798.81 rows=33644
> width=8) (actual time=149.985..200.744 rows=2 loops=1)
>          Recheck Cond: (attribute_name_ids && '{2652}'::integer[])
>          Rows Removed by Index Recheck: 79741
>          Filter: (NOT ((attribute_name_ids || 0) && '{2652}'::integer[]))
>          Rows Removed by Filter: 16723
>          Heap Blocks: exact=8277 lossy=9989
>          ->  Bitmap Index Scan on entity_attribute_name_ids_gin
> (cost=0.00..30.59 rows=33813 width=0) (actual time=9.712..9.713
> rows=37367 loops=1)
>                Index Cond: (attribute_name_ids && '{2652}'::integer[])
>  Planning Time: 0.472 ms
>  Execution Time: 200.977 ms
> (11 rows)
>
>
> * first time when hourly consistency check reported the inconsistency
> [ 2021-08-01 19:04:15 for entity_id=99312880, 2021-08-01 23:04:15 for
> entity_id=99311962 ]
> * timestamps when affected row has been updated [ entity_id=99312880:
> 2021-08-01 16:51:59, 2021-08-01 19:04:06 [ most likely that update led
> to inconsistency ], 2021-08-01 23:04:14, 2021-08-02 01:08:18,
> 2021-08-02 05:12:15, entity_id=99311962 2021-08-01 16:51:59,
> 2021-08-01 19:04:05, 2021-08-01 23:04:13 [ likely culprit ],
> 2021-08-02 01:08:16 ]
> * wal files from the **master server** from the time when the
> inconsistency started to occur [ from 2021-08-01 14:50:37 till
> 2021-08-02 04:31:00 ; sadly i don't have enough to cover the whole
> period between snapshots mentioned below  ]
> * file-level snapshot of the database taken from streaming
> **replication slave** before the problem started to occur [ 2021-08-01
> 12:00:01 ] and after it occured [ 2021-08-02 00:00:01 ]
>
> I'll be happy to run any diagnostics needed, provide access to
> particular wal file.
>

Hello,

We continue to run into this issue and we still do not have any
reliable method to reproduce it. Is there any diagnostic information
that we could provide to make it easier to resolve?

Thank you!


-- 
regards,
Pawel Kudzia



Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

От
Peter Geoghegan
Дата:
On Sun, Jul 25, 2021 at 12:08 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'll work on a patch to add more sanity checks to the GIN code when it
> traverses the tree, to catch the case that it accidentally steps on a
> wrong kind of a page (I'm pretty busy next week, so might not get to
> that until the week after though). I don't think that will help here,
> but who knows, and at least we can rule out some kinds of bugs.

I just noticed that among call sites to ginInsertCleanup() located in
ginvacuum.c, only one caller specifies fill_fsm=true: The call made
when an autoanalyze (i.e. av worker that just does an ANALYZE) is run.
In other words, a call through the special analyze_only path [1].
Note, in particular, that a plain VACUUM, or a plain VACUUM ANALYZE
will always specify fill_fsm=false, regardless of any of the details.
This seems totally contradictory to me.

(***Thinks some more***)

Actually, that's not quite right -- it's not contradictory once you
realize that fill_fsm=true is an extra special path, originally
designed just for the analyze_only path. That's easy to see once you
look at commit dc943ad952 from 2015. Its specific goal was to allow
this special analyze_only path to recycle pages. We need fill_fsm=true
because the analyze_only path won't scan the whole index again at the
end of ginvacuumcleanup(). (Actually, why even scan it in the similar
cleanup-only VACUUM case? Ugh, that's not worth getting into now.)

In summary: Although I always suspected the fill_fsm=true ShiftList()
path here, it's even easier to suspect it now. Because I see now that
it hardly ever gets used inside autovacuum worker processes, though of
course it is possible. It's probably a matter of luck as to whether
you hit the analyze_only + fill_fsm=true ShiftList() path. That might
well have contributed to our difficulty in recreating the bugs here.

This is such a mess. I'm not sure that this insight will be news to
you, Heikki. Perhaps I'm just venting about the confused state of the
code here, following a recent unpleasant reminder (again, see [1]).

[1] https://postgr.es/m/CAH2-WzkjrK556enVtFLmyXEdw91xGuwiyZVep2kp5yQT_-3JDg@mail.gmail.com

--
Peter Geoghegan