Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Дата
Msg-id 1871889b-f5ef-f4d4-8d42-65a9e863e0a9@iki.fi
обсуждение исходный текст
Ответ на Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs
On 17/07/2021 02:40, Peter Geoghegan wrote:
> On Thu, Jul 15, 2021 at 3:56 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> I still think this is worth checking.  Despite the pending list wasn't
>> involved in the index scan with wrong results, the bug could be
>> related to insertion to the pending list.  Or it could be related to
>> moving entries from the pending list to the posting list/tree.
> 
> If I had to guess I'd say that the chances of the pending list being
> involved are high.
> 
> shiftList() deletes pages in the pending list -- this is called from
> ginInsertCleanup(). But shiftList() doesn't call GinPageSetDeleteXid()
> to set an XID that represents when the page is safe to recycle, which
> is what ginDeletePage() always does. Why is that okay?
> 
> Note that we usually use read/share buffer locks when lock-coupling
> inside ginInsertCleanup() -- so AFAICT we won't block-out concurrent
> readers with a link that's about to go stale due to recycling of the
> page. This looks unsafe. Of course it's very hard to tell what might
> be going on, since none of this seems to be explained anywhere.
> 
> Even ginDeletePage() didn't do the right thing with XIDs until bugfix
> commit 52ac6cd2d0. That commit didn't touch any pending list code --
> it should of at least explained why ginInsertCleanup()/shiftList()
> don't need to use the GinPageSetDeleteXid() stuff.

Hmm, seems we should fix that. But could a prematurely recycled deleted 
page cause permanent corruption?

Here's what I've found out so far, looking at the traces Pawel sent:

> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 0 (ginutil.c:636 ginGetStats)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 0 (ginget.c:1866 scanPendingInsert)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 383295 (ginget.c:1894 scanPendingInsert)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 383298 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 2 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 3 (ginget.c:1518 scanGetCandidate)
> [few hundred calls from scanGetCandidate omitted]
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 200586 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 200854 (ginget.c:1518 scanGetCandidate)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 1 (ginbtree.c:89 ginFindLeafPage)
> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 1452 (ginbtree.c:89 ginFindLeafPage)
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: startScanKey called: excludeOnly 0 nentries 1
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0
> 2021-07-16 07:01:19 UTC LOG:  GINBUG: called triConsistentFn(1): 1 0
> ...

So, it begins by reading the metadata page, and then it scans the 
pending list. There are a few hundred pages in the pending list, but no 
matches are found there.

Then, it reads the root page at blk 1. And then, it reads page 1452.

I used pg_filedump on the root page (blk 1):

> Block    1 ********************************************************
> <Header> -----
>  Block Offset: 0x00002000         Offsets: Lower    1376 (0x0560)
>  Block: Size 8192  Version    4            Upper    2776 (0x0ad8)
>  LSN:  logid    410 recoff 0x04c6b6c0      Special  8184 (0x1ff8)
>  Items:  338                      Free Space: 1400
>  Checksum: 0xde29  Prune XID: 0x00000000  Flags: 0x0000 ()
>  Length (including item array): 1376
> 
> <Data> -----
>  Item   1 -- Length:   16  Offset: 7736 (0x1e38)  Flags: NORMAL
>  Item   2 -- Length:   16  Offset: 8168 (0x1fe8)  Flags: NORMAL
>  Item   3 -- Length:   16  Offset: 7800 (0x1e78)  Flags: NORMAL
>  ...
>  Item 337 -- Length:   16  Offset: 7032 (0x1b78)  Flags: NORMAL
>  Item 338 -- Length:   16  Offset: 8152 (0x1fd8)  Flags: NORMAL
> 
> <Special Section> -----
>  GIN Index Section:
>   Flags: 0x00000000 ()  Maxoff: 0
>   Blocks: RightLink (-1)

It has no Flags set, which means that it is an internal entry tree page. 
The items on an internal entry tree page should point to other entry 
tree pages. Now let's take a look at the next page that it read, blk 1452:

> Block 1452 ********************************************************
> <Header> -----
>  Block Offset: 0x00b58000         Offsets: Lower     562 (0x0232)
>  Block: Size 8192  Version    4            Upper    8184 (0x1ff8)
>  LSN:  logid    471 recoff 0xf0cf8a68      Special  8184 (0x1ff8)
>  Items:  134                      Free Space: 7622
>  Checksum: 0x1628  Prune XID: 0x00000000  Flags: 0x0000 ()
>  Length (including item array): 560
> 
> <Data> -----
>  Item   1 -- Length:    0  Offset:    0 (0x0000)  Flags: UNUSED
>  Item   2 -- Length:    0  Offset:    0 (0x0000)  Flags: UNUSED
>  Item   3 -- Length: 7780  Offset:    0 (0x0000)  Flags: UNUSED
>  Item   4 -- Length: 5990  Offset:    4 (0x0004)  Flags: UNUSED
>  Item   5 -- Length:    0  Offset:    6 (0x0006)  Flags: UNUSED
>  Item   6 -- Length:    2  Offset: 15559 (0x3cc7)  Flags: UNUSED
>   Error: Item contents extend beyond block.
>          BlockSize<8192> Bytes Read<8192> Item Start<15561>.
>  Item   7 -- Length:    3  Offset: 15372 (0x3c0c)  Flags: REDIRECT
>  ...
>  Item 133 -- Length: 20913  Offset:    3 (0x0003)  Flags: UNUSED
>   Error: Item contents extend beyond block.
>          BlockSize<8192> Bytes Read<8192> Item Start<20916>.
>  Item 134 -- Length:    0  Offset:    0 (0x0000)  Flags: UNUSED
> 
> <Special Section> -----
>  GIN Index Section:
>   Flags: 0x00000001 (DATA)  Maxoff: 53
>   Blocks: RightLink (-1)

On this page, the DATA flag is set, so it is an internal *posting* tree 
page.

That's weird: the scan walked straight from an internal entry tree page 
(root, at blk 1) into an internal posting tree page (blk 1452). That 
doesn't make sense to me.

The next ReadBuffer call is this:

> 2021-07-16 07:01:19 UTC LOG:  ReadBuffer 1663/16390/16526 read gin blk 15559 (ginbtree.c:183 ginStepRight)

Where did block 15559 come from? How come we're stepping right to it? 
It's not the right sibling of the previously accessed page, 1452. In 
fact, 15559 is a leaf posting tree page. I don't understand how that 
sequence of page reads could happen.

After that, the log shows that it follows the right-links from block 
15559 onward. The incorrectly returned entries are on posting tree page 
255179. The scan reaches that page by following that chain of right-links.

I'm going to continue investigating this on Monday. I'll take a closer 
look at the index entries on those blocks to see if I can make sense of it.

- Heikki



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Japin Li
Дата:
Сообщение: Re: BUG #17111: Database created, cannot be created, but reported as inexist
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows