Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

Поиск
Список
Период
Сортировка
От Pawel Kudzia
Тема Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Дата
Msg-id CAJYBUS-DHFWJLAYLRSsJ=ZpEOvZNnL=VQd2puPQvhaKiJWvheQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs
On Fri, Jul 23, 2021 at 11:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 23 July 2021 18:04:50 EEST, Pawel Kudzia <kudzia@gmail.com> wrote:
> >On Fri, Jul 23, 2021 at 3:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> Ok, so the index is missing entries for the correct key. Looks like the
> >> index entries were inserted into the wrong subtree, under wrong key. But
> >> *how* did that happen? I'm out of ideas, I'm afraid :-(.
>
> Can you tell more about the workload? What INSERT/UPDATE/DELETE commands do you run? How many concurrent clients? Do
yourun vacuum manually or how often does autovacuum kick in? How long did it take for the problem to appear? What
queriesdid you run to find the corruption, and how often? 
>

It's a mix with low percentage of simple INSERT INTO entity [
inserting single row within a transaction ], DELETE FROM entity [
deleting single row, selected by primary key ] and high number of
UPDATE entity SET .. WHERE primname=?.
There are few millions such updates a day. Single transaction has just
one such update, update sets most of the table columns although in
fact most of their values remain unchanged.

Concurrency level  - less than 50 parallel connections.

Vacuum is run manually extremely rarely [ maybe 2-3x a year ].

Based on recent logs - autovacuum happens every ~ 7 days.

Problem was discovered by coincidence - one of users noticed incorrect
results returned, this provoked me to run a script that hammered
database every hour with set of queries in this form:

SELECT primname FROM entity WHERE ( attribute_name_ids && '{{$ani}}' )
AND NOT ( (attribute_name_ids||0) && '{{$ani}}')

It would discover ~ 1-5 problematic entries each week, but there were
periods of few weeks where new problems were not found [ note that
this does not mean there were no such problems, i was hammering only
one of GIN columns - where i had smallest universe of values to
iterate over ].



> I know it's a big ask, but would it be possible to simplify the test case further, to make it reproduce faster?
Ideallyas a self-contained test script with any sensitive data removed. 
>

Sadly - I've put quite a lot of effort to reproduce the issue in test
environment and was never able to do it. Problem occured on the
production only, after prolonged execution of live update traffic.
Note that in the I've moved database to another physical server to
eliminate potential hardware malfunction - that did not help, problem
continued to occur. I've also upgraded from 12.5 to 13.3.

I no longer have a production setup that uses index built in this way:

CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids public.gin__int_ops);

2 weeks ago i've moved to

CREATE INDEX entity_attribute_name_ids_gin ON public.entity USING gin
(attribute_name_ids);

And have not seen the problem manifest again.

Nevertheless - I have 2 file-level dumps of databases where we caught
the inconsistency and would like to help with diagnosing the problem.



--
regards,
Pawel Kudzia



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Следующее
От: Pawel Kudzia
Дата:
Сообщение: Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows