Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
От | Heikki Linnakangas |
---|---|
Тема | Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows |
Дата | |
Msg-id | 14dfcfab-36df-a92d-a42c-01065acf0036@iki.fi обсуждение исходный текст |
Ответ на | Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows (Pawel Kudzia <kudzia@gmail.com>) |
Ответы |
Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
(Pawel Kudzia <kudzia@gmail.com>)
|
Список | pgsql-bugs |
On 18/07/2021 11:27, Pawel Kudzia wrote: > I'm sending those over directly to your mail. Thanks! I looked through those logs, and I'm now not seeing anything wrong with the pages involved. The query first scans the pending list, and doesn't find any matches there. It then descends the entry tree, finds pointer to a posting tree for id 1373, then scans the posting tree. For some reason, the TIDs of those offending rows, (4002784,1) and (4002869,14) are in the posting tree for key '1373', even though the heap tuples don't contain that key. Both of those TIDs are in the same TID list, on page 255179: > postgres=# select * from (select blknum, ((gin_leafpage_items(content)).*) from ginblocks where blknum=255179) x wheretids::text like '%(4002869,14)%'; > blknum | first_tid | nbytes | > tids > > > --------+--------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------- > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------------------------------------------------------------------------- > 255179 | (4002720,16) | 87 | {"(4002720,16)","(4002721,3)","(4002723,5)","(4002739,11)","(4002743,14)","(4002744,2)","(4002784,1)","(4002784,10)","(4002790,46)","(4002790,47)","(4002792 > ,22)","(4002792,23)","(4002792,24)","(4002796,22)","(4002798,2)","(4002800,7)","(4002802,27)","(4002802,28)","(4002803,22)","(4002805,28)","(4002816,13)","(4002817,19)","(4002820,2)","(40028 > 25,17)","(4002854,7)","(4002855,13)","(4002855,14)","(4002869,14)","(4006217,3)","(4006217,11)","(4006307,11)","(4006330,4)","(4006330,8)","(4006331,6)","(4006332,9)","(4006332,10)","(400633 > 3,9)","(4006335,2)","(4006338,1)","(4006342,7)","(4006343,5)","(4006407,2)","(4008912,2)"} > (1 row) I've used pg_filedump and pageinspect on all of the pages on the path to this entry, but I don't see anything wrong with the index. So how did it end up like that? I'm afraid I have no clue :-(. Peter's hunch that it's some bad interaction between deleted pages and the pending list sounds plausible, but looking at the code, I don't see any concrete bugs like that. We have one more lead that we haven't fully explored: amcheck reported this error: ERROR: index "entity_attribute_name_ids_gin": tid exceeds parent's high key in postingTree leaf on block 321746 Give that the GIN amheck patch is still very experimental, that could well be a bug in amcheck instead of real corruption. Also, that error was *not* in the posting tree for key '1373'. I hacked amcheck on my laptop to run only on the posting tree or '1373', and it didn't find any errors there. But let's check it out. I modified 'amcheck' to print more information. I also changed it so that it doesn't stop at the first error, but keeps going, and reports all issues as warnings. Pawel: Could you try the attached version of amcheck? Please apply the attached patch (you'll have to un-apply the previous version or start from a clean source tree), run "SELECT gin_index_parent_check('entity_attribute_name_ids_gin');" again, and send over the result? It should print a bunch of INFOs and WARNINGs to the client's stderr, so a good way to capture that is something like this: psql -c "select gin_index_parent_check('entity_attribute_name_ids_gin');" 2> amcheck-log Let's see if that reveals something interesting. If not, what can we do next? 1. Let's fix the issues with the triconsistent functions that we discussed earlier. They don't explain what we're seeing here, but should be cleaned up nevertheless. 2. Let's add more sanity checks wherever we read a GIN page, to check that it's of the expected kind. Again we have no theory on what sequence of events could cause this, but having more sanity checks seems like a good idea in any case. 3. Run the test case again and wait for the issue to re-appear. This time, let's try to capture the WAL, so that we can trace back the modifications to the pages that lead to the situation. Any other ideas? - Heikki
Вложения
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Kyotaro HoriguchiДата:
Сообщение: Re: BUG #17103: WAL segments are not removed after exceeding max_slot_wal_keep_size
Следующее
От: Pawel KudziaДата:
Сообщение: Re: IRe: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows