Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

Поиск
Список
Период
Сортировка
От Pawel Kudzia
Тема Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Дата
Msg-id CAJYBUS9_VUUYfAen+_pWGa4xxRR-2uvX+_Ycg6W+NqBXVwLY2Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-bugs


On Thu, Jul 15, 2021 at 8:49 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 15/07/2021 12:38, Pawel Kudzia wrote:
>>> How can I identify relevant pages to make selective dump of them for
>>> you, using method proposed by Peter?
>>
>> It's a bit hard without knowing which index pages are affected. But the
>> attached script (dump-gin-page-info.sql) dumps some summary information
>> about the kind of pages there are. That's a start.
>
> I'm attaching output of the proposed commands.

Thanks!

>> Let's try one thing first: I rebased a patch to add GIN support to the
>> amcheck extension [1]. I kind of doubt that it will catch the corruption
>> in this case, but you never know. So please apply the attached
>> v2-0001-Amcheck-for-GIN-13stable.patch patch. It is the same patch I
>> posted at [1], but the installation script is slightly different to make
>> it work on PostgreSQL v13. Then compile install it:
>>
>>     cd contrib/amcheck; make install
>>
>> Then in psql:
>>
>>     CREATE EXTENSION amcheck;
>>     gin_index_parent_check('entity_attribute_name_ids_gin');
>
> Sadly i'm getting this when executing CREATE EXTENSION amcheck;
>
> ERROR:  extension "amcheck" has no installation script nor update path
> for version "1.2.1"

Hmm, that's odd, it worked for me. Did "make install" copy the
"amcheck--1.2--1.2.1.sql" file to the right place? It should be
installed the same directory as "amcheck.control".



thank you for the hint. i've messed up patching. now i can get one step further:

data=# CREATE EXTENSION amcheck;
CREATE EXTENSION
data=# gin_index_parent_check('entity_attribute_name_ids_gin');
ERROR:  syntax error at or near "gin_index_parent_check"
LINE 1: gin_index_parent_check('entity_attribute_name_ids_gin');
        ^

i've also applied trace-gin-readbuffer-2.patch

 
Anyway, let's try a different tack. Here's another debugging patch. It
causes a LOG message to be printed whenever a GIN index page is
accessed, like this:

2021-07-15 21:22:20.119 EEST [972708] LOG:  ReadBuffer 1663/12678/27035
read gin blk 0 (ginget.c:1832 scanPendingInsert)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT:  explain analyze select
* from test_intarray where a @> '{2}';
2021-07-15 21:22:20.119 EEST [972708] LOG:  ReadBuffer 1663/12678/27035
read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
2021-07-15 21:22:20.119 EEST [972708] STATEMENT:  explain analyze select
* from test_intarray where a @> '{2}';

Please apply this patch, run the query again, and copy the resulting
log.

that'll be the file called "log", send directly to your e-mail address.
 
I'm also interested in the contents of the pages that are accessed,
so if you could then run these commands to dump the contents of those
index pages:

# extract just the block numbers from the log
cat <path to postgres log> | perl -ne '/.*read gin blk (\d+)/ && print
"$1\n" ' | sort -n |uniq > /tmp/blocknums

# dump contents of each of those index pages to '/tmp/block-contents'
psql postgres -c "create temp table blocknums(blknum int); copy
blocknums from '/tmp/blocknums'; copy (select blknum,
get_raw_page('entity_attribute_name_ids_gin', blknum) from blocknums) to
'/tmp/block-contents';"

done, side note - i had to run "create extension pageinspect" and provide name of my database - "data" rather than "postgres".


Please send over the resulting PostgreSQL log, and
'/tmp/block-contents', and I'll take a closer look into the contents to
see if there's something funny. (A private email is OK if you don't feel
comfortable sharing it with the world).

One more thing, please also add the 'ctid' virtual column to the query,
like this:

SELECT ctid, entity_id FROM entity WHERE
( attribute_name_ids && '{1737}' ) AND NOT ( (attribute_name_ids||0)
&& '{1737}') LIMIT 10;

that's how the queries above were run. output:

     ctid     | entity_id
--------------+-----------
 (4002784,1)  |  38048120
 (4002869,14) |  95333744
(2 rows)

 
Also, did you happen to keep the WAL log from around the time that the
query started producing wrong results? That would also be useful
evidence for how it got into this situation.

sadly i don't. as noted earlier - we've already switched the production setup so i have no way of leading to corruption again, i only have two file-level backups of the database with two different inconsistencies in the SELECTs. 

greetings!

--
regards,
Pawel Kudzia

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size
Следующее
От: Sergei Kornilov
Дата:
Сообщение: Re: BUG #17111: Database created, cannot be created, but reported as inexist