[PATCH] Btree BackwardScan race condition on Standby during VACUUM

Поиск
Список
Период
Сортировка
От Michail Nikolaev
Тема [PATCH] Btree BackwardScan race condition on Standby during VACUUM
Дата
Msg-id CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
обсуждение исходный текст
Ответы Re: [PATCH] Btree BackwardScan race condition on Standby duringVACUUM  ("Andrey M. Borodin" <x4mmm@yandex-team.ru>)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM  (Peter Geoghegan <pg@bowt.ie>)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM  (Peter Geoghegan <pg@bowt.ie>)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
Hello, hackers.

------ ABSTRACT ------
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.

------ REMARK ------
While working on support for index hint bits on standby [1] I have
started to getting
"ERROR:  could not find left sibling of block XXXX in index XXXX"
during stress tests.

I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later...  I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.

------ HOW TO REPRODUCE  ------
It is not easy to reproduce the issue but you can try (tested on
REL_12_STABLE and master):

1) Setup master (sync replica and 'remote_apply' are not required -
just make scripts simpler):
autovacuum = off
synchronous_standby_names = '*'
synchronous_commit = 'remote_apply'

2) Setup standby:
primary_conninfo = 'user=postgres host=127.0.0.1 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
port = 6543

3) Prepare pgbench file with content (test.bench):
BEGIN;
select * from pgbench_accounts order by aid desc limit 1;
END;

4) Prepare the index:
./pgbench -i -s 10 -U postgres
./psql -U postgres -c "delete from pgbench_accounts where aid IN
(select aid from pgbench_accounts order by aid desc limit 500000)"

5) Start index scans on the standby:
./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 10000 -U
postgres -p 6543

6) Run vacuum on the master:
./psql -U postgres -c "vacuum pgbench_accounts"

7) You should see something like this:
> progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902
> .....
> progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338
> client 3 script 0 aborted in command 1 query 0: ERROR:  could not find left sibling of block 1451 in index
"pgbench_accounts_pkey"
> progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709
> .....
> progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238
> client 2 script 0 aborted in command 1 query 0: ERROR:  could not find left sibling of block 2104 in index
"pgbench_accounts_pkey"
> progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417
> .....
> progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251
> client 4 script 0 aborted in command 1 query 0: ERROR:  could not find left sibling of block 2501 in index
"pgbench_accounts_pkey"

I was able to reproduce issue with vanilla PG_12 on different systems
including the Windows machine.
On some servers it happens 100% times. On others - very seldom.

It is possible to radically increase chance to reproduce the issue by
adding a sleep in btree_xlog_unlink_page[7]:
>   + pg_usleep(10 * 1000L);
>
>   /* Rewrite target page as empty deleted page */
>   buffer = XLogInitBufferForRedo(record, 0);

------  WHAT HAPPENS  ------
It is race condition issue between btree_xlog_unlink_page and _bt_walk_left.

btree_xlog_unlink_page removes page from btree changing btpo_prev and
btpo_next of its left and right siblings to point
to the each other and marks target page as removed. All these
operations are done using one-page-at-a-time locking because of[4]:

>    * In normal operation, we would lock all the pages this WAL record
>    * touches before changing any of them.  In WAL replay, it should be okay
>    * to lock just one page at a time, since no concurrent index updates can
>    * be happening, and readers should not care whether they arrive at the
>    * target page or not (since it's surely empty).

_bt_walk_left walks left in very tricky way. Please refer to
src/backend/access/nbtree/README for details[5]:

>    Moving left in a backward scan is complicated because we must consider
>    the possibility that the left sibling was just split (meaning we must find
>    the rightmost page derived from the left sibling), plus the possibility
>    that the page we were just on has now been deleted and hence isn't in the
>    sibling chain at all anymore.

So, this is how race is happens:

0) this is target page (B) and its siblings.
A <---> B <---> C ---> END

1) walreceiver starts btree_xlog_unlink_page for the B. It is changes
the links from C to A and from A to C (I hope my scheme will be
displayed correctly):
A <---- B ----> C ---> END
^               ^
 \_____________/

But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2].

2) other backend starts _bt_walk_left from B.
It checks A, goes to from A to C by updated btpo_next and later sees
end of the btree.
So, next step is to check if B was deleted (nbtsearch:2011)[3] and try
to recover.

But B is not yet deleted! It will be marked as BTP_DELETED after a few
millis by walreceiver but not yet.
So, nbtsearch:2046[6] is happens.


------  HOW TO FIX  ------
The first idea was to mark page as BTP_DELETED before updating siblings links.

Second - to update pages in the following order:
* change btpo_next
* mark as BTP_DELETED
* change btpo_prev

Such a changes fix the exactly described above race condition... but
cause a more tricky ones to start happening.
And I think it is better to avoid any too complex unclear solutions here..

Another idea - to sleep a little waiting walreceiver to mark the page
as deleted. But it seems to feel too ugly. Also it is unclear how long
to wait.

So, I think right way is to lock all three pages as it is done on the
primary. As far as I can see it is not causes any real performance
regression.

Patch is attached (on top of REL_12_STABLE).

Thanks,
Michail.

[1]: https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697
[3]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011
[4]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581
[5]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314
[6]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046
[7]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Julien Rouhaud
Дата:
Сообщение: Re: Collation versioning
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: error context for vacuum to include block number