IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

Поиск
Список
Период
Сортировка
On 17/06/2021 22:49, Tom Lane wrote:
> I wrote:
>> Pawel Kudzia <kudzia@gmail.com> writes:
>>> with help from IRC we've found that decreasing work_mem from 1MB to 256kB
>>> or less makes the problem go away:
> 
>> Hmm.  So that suggests that the index itself is *not* corrupt,
>> but the problem is associated with a bug in the indexscan
>> algorithms.
> 
> After staring at the code I think there is at least one bug, and
> possibly two, in shimTriConsistentFn.  That's likely to be implicated
> here because intarray's GIN opclass only provides a bool consistent
> function.  I'm not very clear on the circumstances that lead to the scan
> code inventing GIN_MAYBE inputs, so I haven't been able to construct a
> test case.
> 
> The definite bug is triggered because intarray relies on the API
> specification that says that if the search mode is
> GIN_SEARCH_MODE_DEFAULT, then the consistentFn will only be called
> when there's at least one TRUE key:
> 
>          case RTOverlapStrategyNumber:
>              /* result is not lossy */
>              *recheck = false;
>              /* at least one element in check[] is true, so result = true */
>              res = true;
>              break;
> 
> shimTriConsistentFn ignores this and calls it on all-FALSE inputs,
> for which it'll incorrectly get a TRUE result, as it will also for
> all the following checks.  The upshot is that shimTriConsistentFn
> will convert any case with a MAYBE input to a hard TRUE result,
> with no recheck requirement.  This'd easily explain the reported
> misbehavior.

That's subtle. The documentation says:

>        If <literal>*searchMode</literal> is set to
>        <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is
>        initialized to before call), only items that match at least one of
>        the returned keys are considered candidate matches.

I guess you can read that as "consistentFn will only be called when 
there is at least one matching item", but that didn't occur to me when I 
read that at first. But I agree we need to fix shimTriConsistentFn to 
respect that.

> The other thing that I'm unsure about, because the code is horribly
> underdocumented, is that it's not very clear whether
> shimTriConsistentFn is accurately converting between the bool and
> the tristate APIs.  That's because it's not very clear what the
> tristate API actually *is*.  In particular, is the end state of
> key->recheckCurItem meaningful in the tristate case?  If it's not,
> then the short-circuit case for no MAYBE inputs is broken, because
> it'll return TRUE when the bool consistentFn is trying to tell us
> to recheck.  But if it is meaningful, then the multiway case is broken,
> because it will return the recheckCurItem result set by the last call to
> the bool consistentfn; which might be false even though other calls said
> true.  (So in that case I think we'd need "key->recheckCurItem = recheck"
> at the end.)  I also wonder how any of that works correctly for real
> triconsistent functions, which don't have access to the recheckCurItem
> flag.
> 
> Anyway, I'm punting this to the code authors.  I'd like to see
> some comments clarifying what the API really is, not just a
> quick-n-dirty code fix.

I've been trying to create a test case for this, but no luck so far. I 
cannot find a codepath that would hit these bugs with the kind of query 
that Pawel used. The search condition is very simple: "column && 
'{constant}'", with only one key and one constant to search for. There 
are three calls to triConsistentFn:

1. In startScanKey, but that's only reached if (key->nentries > 1), so 
not reachable with Pawel's query.

2. In keyGetItem, when haveLossyEntry==true. But AFAICS, haveLossyEntry 
is never true with Pawel's query. That would require curItem to be a 
lossy pointer, and that in turn can only happen when matchBitmap is 
used, and that's only used with partial match queries and with queries 
that need a full-index scan.

3. The second call in keyGetItem is reached, but it is only passed 
GIN_MAYBE when curItem is lossy. Which isn't possible with this query, 
see point 2.

AFAICS the recheckCurItem bug should also not cause wrong results with 
that query. I must be missing something.

I could probably construct a test case with a different query, one 
involving multiple elements in the search key, but I'd like to have a 
solid explanation for the original report.

Pawel, is this a production system, or would it be possible for you to 
build from sources with a patch with some extra debugging output?

- Heikki



В списке pgsql-bugs по дате отправления:

Предыдущее
От: dhanabakeeswari v
Дата:
Сообщение: Re: BUG #17060: ERROR: column "rownum" does not exist
Следующее
От: Pawel Kudzia
Дата:
Сообщение: Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows