Обсуждение: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

Поиск
Список
Период
Сортировка

Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
chenhj
Дата:
Hi,all

In our PostgreSQL 10.7(rhel 6.3) database, autovacuum process and many insert processes blocked in gin index's LWLock:buffer_content for long time. 

In other words, the following gin index lwlock deadlock phenomenon has occurred again. Since the following bug in 10.7 has been fixed. So this should be a new bug.

https://www.postgresql.org/message-id/flat/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com

We have already obtained coredump files of autovacuum process and one of insert processes.
Unfortunately the insert process(run by gcore) held no lwlock, it should be another process(we did not fetch core file) that hold the lwlock needed for autovacuum process.

the stack is as following:

## stack of one insert process: Acquire lock 0x7f6c517dbfa4 which was held by vacuum process
----------------------------------------------------------------------------
(gdb) bt
#0  0x000000369ea0da00 in sem_wait () from /lib64/libpthread.so.0
#1  0x00000000006a7910 in PGSemaphoreLock (sema=0x7f6c4f76a7b8) at pg_sema.c:316
#2  0x0000000000718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) at lwlock.c:1233
#3  0x000000000048b622 in ginTraverseLock (buffer=224225, searchMode=0 '\000') at ginbtree.c:40
#4  0x000000000048ca13 in ginFindLeafPage (btree=0x7fffc71c4ea0, searchMode=0 '\000', snapshot=0x0) at ginbtree.c:97
#5  0x00000000004894db in ginInsertItemPointers (index=<value optimized out>, rootBlkno=<value optimized out>, items=<value optimized out>, nitem=<value optimized out>, buildStats=0x0)
    at gindatapage.c:1909
#6  0x00000000004863a7 in ginEntryInsert (ginstate=0x1c72158, attnum=1, key=20190913, category=0 '\000', items=0x1c81508, nitem=72, buildStats=0x0) at gininsert.c:214
#7  0x000000000049219a in ginInsertCleanup (ginstate=0x1c72158, full_clean=0 '\000', fill_fsm=1 '\001', forceCleanup=<value optimized out>, stats=<value optimized out>) at ginfast.c:878
#8  0x000000000049308e in ginHeapTupleFastInsert (ginstate=0x1c72158, collector=<value optimized out>) at ginfast.c:443
#9  0x0000000000486749 in gininsert (index=<value optimized out>, values=0x7fffc71c54f0, isnull=0x7fffc71c5600 "", ht_ctid=0x1c6d3a4, heapRel=<value optimized out>, 
    checkUnique=<value optimized out>, indexInfo=0x1c61da8) at gininsert.c:522
#10 0x00000000005f75f0 in ExecInsertIndexTuples (slot=0x1c62168, tupleid=0x1c6d3a4, estate=0x1c61768, noDupErr=0 '\000', specConflict=0x0, arbiterIndexes=0x0) at execIndexing.c:387
#11 0x0000000000616497 in ExecInsert (pstate=0x1c61ab8) at nodeModifyTable.c:519
#12 ExecModifyTable (pstate=0x1c61ab8) at nodeModifyTable.c:1779
#13 0x00000000005fb6bf in ExecProcNode (queryDesc=0x1c67760, direction=<value optimized out>, count=0, execute_once=-72 '\270') at ../../../src/include/executor/executor.h:250
#14 ExecutePlan (queryDesc=0x1c67760, direction=<value optimized out>, count=0, execute_once=-72 '\270') at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x1c67760, direction=<value optimized out>, count=0, execute_once=-72 '\270') at execMain.c:364
#16 0x00007f6e226aa6f8 in pgss_ExecutorRun (queryDesc=0x1c67760, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at pg_stat_statements.c:889
#17 0x00007f6e224a474d in explain_ExecutorRun (queryDesc=0x1c67760, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at auto_explain.c:267
#18 0x000000000072a15b in ProcessQuery (plan=<value optimized out>, 
    sourceText=0x1c21458 "INSERT INTO bi_dm.tdm_wh_shopgds_fnsh_rt (STATIS_DATE,SITE_CD,LGORT,ZSIZE,ZVTWEG,VSBED,TOTAL_CNT,FNSH_CNT,UNFNSH_CNT,ETL_TIME,DEPT_CD,TMALL_FLG,BUSS_TP,ZCKYWLX) VALUES($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$"..., params=0x1c21580, queryEnv=0x0, dest=<value optimized out>, completionTag=0x7fffc71c5de0 "") at pquery.c:161
#19 0x000000000072a395 in PortalRunMulti (portal=0x1c57f18, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xc9b480, altdest=0xc9b480, completionTag=0x7fffc71c5de0 "") at pquery.c:1286
#20 0x000000000072aa98 in PortalRun (portal=0x1c57f18, count=1, isTopLevel=1 '\001', run_once=1 '\001', dest=0x1c25768, altdest=0x1c25768, completionTag=0x7fffc71c5de0 "") at pquery.c:799
#21 0x0000000000728c9a in exec_execute_message (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1bbb800 "lbiwhdb", username=<value optimized out>) at postgres.c:2007
#22 PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1bbb800 "lbiwhdb", username=<value optimized out>) at postgres.c:4180
#23 0x00000000006bb43a in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4405
---Type <return> to continue, or q <return> to quit---
#24 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4077
#25 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1755
#26 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1363
#27 0x000000000063b4d0 in main (argc=3, argv=0x1b839e0) at main.c:228
(gdb) f 2
#2  0x0000000000718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) at lwlock.c:1233
1233 lwlock.c: No such file or directory.
in lwlock.c
(gdb) p num_held_lwlocks
$1 = 0
(gdb) 


## stack of autovacuum:Acquire lock 0x7f6c519ba5a4 and hold 0x7f6c517dbfa4, 0x7f6c51684f64
--------------------------------------
(gdb) bt
#0  0x000000369ea0da00 in sem_wait () from /lib64/libpthread.so.0
#1  0x00000000006a7910 in PGSemaphoreLock (sema=0x7f6c4f77fdb8) at pg_sema.c:316
#2  0x0000000000718225 in LWLockAcquire (lock=0x7f6c519ba5a4, mode=LW_EXCLUSIVE) at lwlock.c:1233
#3  0x00000000004900fb in ginDeletePage (gvs=0x7fffc71c26d0, blkno=3709, isRoot=0 '\000', parent=<value optimized out>, myoff=2) at ginvacuum.c:154
#4  ginScanToDelete (gvs=0x7fffc71c26d0, blkno=3709, isRoot=0 '\000', parent=<value optimized out>, myoff=2) at ginvacuum.c:297
#5  0x000000000049004c in ginScanToDelete (gvs=0x7fffc71c26d0, blkno=6418, isRoot=1 '\001', parent=<value optimized out>, myoff=0) at ginvacuum.c:281
#6  0x0000000000490d1f in ginVacuumPostingTree (info=0x7fffc71c4da0, stats=<value optimized out>, callback=<value optimized out>, callback_state=<value optimized out>) at ginvacuum.c:408
#7  ginbulkdelete (info=0x7fffc71c4da0, stats=<value optimized out>, callback=<value optimized out>, callback_state=<value optimized out>) at ginvacuum.c:643
#8  0x00000000005e9d05 in lazy_vacuum_index (indrel=0x7f6c4f396fe8, stats=0x1c25ad8, vacrelstats=0x1c25578) at vacuumlazy.c:1621
#9  0x00000000005eaa29 in lazy_scan_heap (onerel=<value optimized out>, options=<value optimized out>, params=0x1ca7550, bstrategy=<value optimized out>) at vacuumlazy.c:1311
#10 lazy_vacuum_rel (onerel=<value optimized out>, options=<value optimized out>, params=0x1ca7550, bstrategy=<value optimized out>) at vacuumlazy.c:258
#11 0x00000000005e8d68 in vacuum_rel (relid=176178, relation=<value optimized out>, options=97, params=0x1ca7550) at vacuum.c:1445
#12 0x00000000005e9187 in vacuum (options=97, relation=0x7fffc71c5640, relid=<value optimized out>, params=0x1ca7550, va_cols=0x0, bstrategy=<value optimized out>, isTopLevel=1 '\001')
    at vacuum.c:306
#13 0x00000000006ab262 in autovacuum_do_vac_analyze () at autovacuum.c:3135
#14 do_autovacuum () at autovacuum.c:2490
#15 0x00000000006aba70 in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at autovacuum.c:1707
#16 0x00000000006abb46 in StartAutoVacWorker () at autovacuum.c:1504
#17 0x00000000006b972a in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:5462
#18 sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:5159
#19 <signal handler called>
#20 0x000000369e2e1393 in __select_nocancel () from /lib64/libc.so.6
#21 0x00000000006baa66 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1719
#22 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1363
#23 0x000000000063b4d0 in main (argc=3, argv=0x1b839e0) at main.c:228
(gdb) f 2
#2  0x0000000000718225 in LWLockAcquire (lock=0x7f6c519ba5a4, mode=LW_EXCLUSIVE) at lwlock.c:1233
1233 lwlock.c: No such file or directory.
in lwlock.c
(gdb) p num_held_lwlocks
$1 = 2
(gdb) p InterruptHoldoffCount
$2 = 3
(gdb) p held_lwlocks
$3 = {{lock = 0x7f6c517dbfa4, mode = LW_EXCLUSIVE}, {lock = 0x7f6c51684f64, mode = LW_EXCLUSIVE}, {lock = 0x7f6c4f782a00, mode = LW_SHARED}, {lock = 0x7f6c4f78df80, mode = LW_EXCLUSIVE}, {
    lock = 0x7f6c51a7a1e4, mode = LW_EXCLUSIVE}, {lock = 0x7f6c4f78df80, mode = LW_EXCLUSIVE}, {lock = 0x7f6c53a7eb64, mode = LW_EXCLUSIVE}, {lock = 0x7f6c4f78df80, mode = LW_EXCLUSIVE}, {
    lock = 0x0, mode = LW_EXCLUSIVE} <repeats 192 times>}
(gdb) 
(gdb) f 4
#4  ginScanToDelete (gvs=0x7fffc71c26d0, blkno=3709, isRoot=0 '\000', parent=<value optimized out>, myoff=2) at ginvacuum.c:297
297 in ginvacuum.c
(gdb) p *me
$15 = {child = 0x0, parent = 0x7fffc71c26a0, blkno = 0, leftBlkno = 5649, isRoot = 0 '\000'}
(gdb) p *(&BufferDescriptors[(buffer - 1)].bufferdesc)
$3 = {tag = {rnode = {spcNode = 1663, dbNode = 16397, relNode = 692113}, forkNum = MAIN_FORKNUM, blockNum = 3709}, buf_id = 202271, state = {value = 3549691907}, wait_backend_pid = 0, 
  freeNext = -2, content_lock = {tranche = 54, state = {value = 1627389952}, waiters = {head = 539, tail = 1223}}}
(gdb) p &((&BufferDescriptors[(buffer - 1)].bufferdesc)->content_lock)
$5 = (LWLock *) 0x7f6c51684f64

(gdb) f 5
#5  0x000000000049004c in ginScanToDelete (gvs=0x7fffc71c26d0, blkno=6418, isRoot=1 '\001', parent=<value optimized out>, myoff=0) at ginvacuum.c:281
281 in ginvacuum.c
(gdb) p *(&BufferDescriptors[(buffer - 1)].bufferdesc)
$6 = {tag = {rnode = {spcNode = 1663, dbNode = 16397, relNode = 692113}, forkNum = MAIN_FORKNUM, blockNum = 6418}, buf_id = 224224, state = {value = 3549692473}, wait_backend_pid = 14261, 
  freeNext = -2, content_lock = {tranche = 54, state = {value = 1627389952}, waiters = {head = 527, tail = 370}}}
(gdb) p &((&BufferDescriptors[(buffer - 1)].bufferdesc)->content_lock)
$7 = (LWLock *) 0x7f6c517dbfa4


According to above information in core file and source code, autovacuum process is trying to lock the block 5649,and be blocked.

                                          6418(1. held LWLock:0x7f6c517dbfa4) root
                                            |
5649(3. Acquire LWLock:0x7f6c519ba5a4)     -->  3709(2.held LWLock:0x7f6c51684f64)

Does the locking order of autovacuum process(root->right->left) correct? While insert process lock gin buffer by order of bottom->top and left->right.

1. vacuum(root->right->left):
---------------------------------------------------------------------
static void
ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
{
if (ginVacuumPostingTreeLeaves(gvs, rootBlkno))
{
...
LockBufferForCleanup(buffer);//1. (1*)lock the root
...
ginScanToDelete(gvs, rootBlkno, true, &root, InvalidOffsetNumber);//2. (2)scan to delete for root
}
}

static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
{
...
if (!isRoot)
LockBuffer(buffer, GIN_EXCLUSIVE);//2.1.1 (4)lock the first child(left); 
                                  //2.2.1 (7*)lock the second child(right); 
...

if (!GinPageIsLeaf(page))
{
...
for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++)
{
PostingItem *pitem = GinDataPageGetPostingItem(page, i);

if (ginScanToDelete(gvs, PostingItemGetBlockNumber(pitem), FALSE, me, i))// 2.1 (3)scan to delete for the first child(left);
                                                                         // 2.2 (6)scan to delete for the second child(right);
i--;
}
}
...
if (isempty)
{
/* we never delete the left- or rightmost branch */
if (me->leftBlkno != InvalidBlockNumber && !GinPageRightMost(page))
{
Assert(!isRoot);
ginDeletePage(gvs, blkno, me->leftBlkno, me->parent->blkno, myoff, me->parent->isRoot);//2.2.2 (8)delete the second child(right)
meDelete = TRUE;
}
}

if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);//2.1.2 (5)unlock the first child(left)
...
}

static void
ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkno,
  BlockNumber parentBlkno, OffsetNumber myoff, bool isParentRoot)
{
...
lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
RBM_NORMAL, gvs->strategy);
...
LockBuffer(lBuffer, GIN_EXCLUSIVE);// 2.2.2.1 (9*)lock the first child(left); 
...
}
---------------------------------------------------------------------

2. insert(bottom->top,left->right):

https://www.postgresql.org/message-id/19e4290b.3bb.166ea2c6831.Coremail.chjischj%40163.com
> # insert process(ginInsertValue())
>                      644(root blkno)
>                       |
>      7054(2. held LWLock:0x2aaac587ae64)     ----rightlink---->    xxxx(3. Acquire LWLock:0x2aaab4009564,buffer = 2119038,blkno should be 9954)
>        |
> 701(1. held LWLock:0x2aaab670dfe4)
> The ginInsertValue() function above gets the lwlock in the order described in the README.
> src/backend/access/gin/README
> ---------------------------------------------------------------
> To avoid deadlocks, B-tree pages must always be locked in the same order:
> left to right, and bottom to top.
> ...
> -----------------------------------------------------------------


Regards,
Chen Huajun

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
Hi!

Thank you for reporting.

On Sun, Sep 29, 2019 at 11:17 AM chenhj <chjischj@163.com> wrote:
> Does the locking order of autovacuum process(root->right->left) correct? While insert process lock gin buffer by
orderof bottom->top and left->right.
 
>
> 1. vacuum(root->right->left):

Starting from root seems OK for me, because vacuum blocks all
concurrent inserts before doing this.  But this needs to be properly
documented in readme.

Locking from right to left is clearly wrong.  It could deadlock with
concurrent ginStepRight(), which locks from left to right.  I expect
this happened in your case.  I'm going to reproduce this and fix.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Sun, Sep 29, 2019 at 5:38 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Sun, Sep 29, 2019 at 11:17 AM chenhj <chjischj@163.com> wrote:
> > Does the locking order of autovacuum process(root->right->left) correct? While insert process lock gin buffer by
orderof bottom->top and left->right.
 
> >
> > 1. vacuum(root->right->left):
>
> Starting from root seems OK for me, because vacuum blocks all
> concurrent inserts before doing this.  But this needs to be properly
> documented in readme.
>
> Locking from right to left is clearly wrong.  It could deadlock with
> concurrent ginStepRight(), which locks from left to right.  I expect
> this happened in your case.  I'm going to reproduce this and fix.

I just managed to reproduce this using two sessions on master branch.

session 1
    session 2

# create table test with (autovacuum_enabled = false) as (select
array[1] ar from generate_series(1,20000) i);
# create index test_ar_idx on test using gin (ar);
# vacuum analyze test;
# delete from test;

    # set enable_seqscan = off;
    gdb> b ginbtree.c:150
    # select * from test where ar @> '{1}'::integer[];
    Step in gdb just before ReadBuffer() in ReleaseAndReadBuffer().

gdb> b ginvacuum.c:155
# vacuum test;

    gdb > continue
gdb> continue

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Sun, Sep 29, 2019 at 6:12 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Sun, Sep 29, 2019 at 5:38 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > On Sun, Sep 29, 2019 at 11:17 AM chenhj <chjischj@163.com> wrote:
> > > Does the locking order of autovacuum process(root->right->left) correct? While insert process lock gin buffer by
orderof bottom->top and left->right.
 
> > >
> > > 1. vacuum(root->right->left):
> >
> > Starting from root seems OK for me, because vacuum blocks all
> > concurrent inserts before doing this.  But this needs to be properly
> > documented in readme.
> >
> > Locking from right to left is clearly wrong.  It could deadlock with
> > concurrent ginStepRight(), which locks from left to right.  I expect
> > this happened in your case.  I'm going to reproduce this and fix.
>
> I just managed to reproduce this using two sessions on master branch.
>
> session 1
>     session 2
>
> # create table test with (autovacuum_enabled = false) as (select
> array[1] ar from generate_series(1,20000) i);
> # create index test_ar_idx on test using gin (ar);
> # vacuum analyze test;
> # delete from test;
>
>     # set enable_seqscan = off;
>     gdb> b ginbtree.c:150
>     # select * from test where ar @> '{1}'::integer[];
>     Step in gdb just before ReadBuffer() in ReleaseAndReadBuffer().
>
> gdb> b ginvacuum.c:155
> # vacuum test;
>
>     gdb > continue
> gdb> continue

Patch with fix is attached.  Idea is simple: ginScanToDelete() now
keeps exclusive lock on left page eliminating the need to relock it.
So, we preserve left-to-right locking order and can't deadlock with
ginStepRight().

Also, we need to adjust Concurrency section in GIN README.  For me the
description looks vague and inconsistent even with current behavior.
I'm going to post this later.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Peter Geoghegan
Дата:
On Sun, Sep 29, 2019 at 7:38 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Starting from root seems OK for me, because vacuum blocks all
> concurrent inserts before doing this.  But this needs to be properly
> documented in readme.

I never got an adequate answer to this closely related question almost
two years ago:

https://www.postgresql.org/message-id/CAH2-Wz=GTnAPzEEZqYELOv3h1Fxpo5xhMrBP6aMGEKLKv95csQ@mail.gmail.com

In general, ginInsertCleanup() seems badly designed. Why is it okay
that there is no nbtree-style distinction between page deletion and
page recycling?

> Locking from right to left is clearly wrong.  It could deadlock with
> concurrent ginStepRight(), which locks from left to right.  I expect
> this happened in your case.  I'm going to reproduce this and fix.

I am sick and tired of seeing extremely basic errors like this within
GIN's locking protocols. Bugs happen, but these are not ordinary bugs.
They're more or less all a result of giving no thought to the high
level design. I'm not blaming you for this, or any one person. But
this is not okay.

Anything around index concurrency needs to be explained in
excruciating detail, while taking a top-down approach that applies
general rules (e.g. you can only do lock coupling left to right, or
bottom to top in nbtree). Anything less than that should be assumed to
be wrong on general principle.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Andrey Borodin
Дата:

> 29 сент. 2019 г., в 21:27, Alexander Korotkov <a.korotkov@postgrespro.ru> написал(а):
>
> Patch with fix is attached.  Idea is simple: ginScanToDelete() now
> keeps exclusive lock on left page eliminating the need to relock it.
> So, we preserve left-to-right locking order and can't deadlock with
> ginStepRight().

In this function ginDeletePage(gvs, blkno, BufferGetBlockNumber(me->leftBuffer),...)
we are going to reread buffer
lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
                                 RBM_NORMAL, gvs->strategy);
Is it OK?


> 30 сент. 2019 г., в 0:52, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> Why is it okay
> that there is no nbtree-style distinction between page deletion and
> page recycling?
As far as I understand deleted page is stamped with
GinPageSetDeleteXid(page, ReadNewTransactionId());
It will not be recycled until that Xid is far behind.
BTW we found a small bug (wraparound) in similar GiST and B-tree implementations.
Probably, it's there in GIN too.

--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud




Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Mon, Sep 30, 2019 at 8:38 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > 29 сент. 2019 г., в 21:27, Alexander Korotkov <a.korotkov@postgrespro.ru> написал(а):
> >
> > Patch with fix is attached.  Idea is simple: ginScanToDelete() now
> > keeps exclusive lock on left page eliminating the need to relock it.
> > So, we preserve left-to-right locking order and can't deadlock with
> > ginStepRight().
>
> In this function ginDeletePage(gvs, blkno, BufferGetBlockNumber(me->leftBuffer),...)
> we are going to reread buffer
> lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
>                                  RBM_NORMAL, gvs->strategy);
> Is it OK?


That's related not only to left buffer.  We also could pass buffer
instead of deleteBlkno.  And with some code changes it's also possible
to pass buffer instead of parentBlkno.  But I decided to keep code
changes minimal at least for backpatch version.  We could apply such
optimization to master as a separate patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Sun, Sep 29, 2019 at 10:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Sep 29, 2019 at 7:38 AM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Starting from root seems OK for me, because vacuum blocks all
> > concurrent inserts before doing this.  But this needs to be properly
> > documented in readme.
>
> I never got an adequate answer to this closely related question almost
> two years ago:
>
> https://www.postgresql.org/message-id/CAH2-Wz=GTnAPzEEZqYELOv3h1Fxpo5xhMrBP6aMGEKLKv95csQ@mail.gmail.com
>
> In general, ginInsertCleanup() seems badly designed. Why is it okay
> that there is no nbtree-style distinction between page deletion and
> page recycling?

Thank you for pointing.  I remember this thread, but don't remember
details.  I'll reread it.

> > Locking from right to left is clearly wrong.  It could deadlock with
> > concurrent ginStepRight(), which locks from left to right.  I expect
> > this happened in your case.  I'm going to reproduce this and fix.
>
> I am sick and tired of seeing extremely basic errors like this within
> GIN's locking protocols. Bugs happen, but these are not ordinary bugs.
> They're more or less all a result of giving no thought to the high
> level design. I'm not blaming you for this, or any one person. But
> this is not okay.
>
> Anything around index concurrency needs to be explained in
> excruciating detail, while taking a top-down approach that applies
> general rules (e.g. you can only do lock coupling left to right, or
> bottom to top in nbtree). Anything less than that should be assumed to
> be wrong on general principle.

Frankly speaking I'm not very happy with special version of B-tree,
which is builtin to GIN.  This version of B-tree is lacking of high
keys.  AFAIR because of lack of high keys, we can't implement the same
concurrency model as nbtree.  Instead we have to do super-locking for
page deletion and so on.  That looks ridiculous for me.  I think in
future we should somehow reimplement GIN on top of nbtree.

But we have to provide some way less radical fixes for our stable
releases.  In particular, I believe patch I've posted in this thread
makes situation better not worse.  That is it fixes one bug without
introducing mode bugs.  But I'm going to analyze more on this and
document GIN concurrency better in the README.  Probably, I'll spot
more details.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Peter Geoghegan
Дата:
On Sun, Sep 29, 2019 at 10:38 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> As far as I understand deleted page is stamped with
> GinPageSetDeleteXid(page, ReadNewTransactionId());
> It will not be recycled until that Xid is far behind.

That only gets used within posting tree pages, though.
ginInsertCleanup() is concerned with pending list pages.

> BTW we found a small bug (wraparound) in similar GiST and B-tree implementations.
> Probably, it's there in GIN too.

Probably, but that's much less of a problem to me.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Peter Geoghegan
Дата:
On Mon, Sep 30, 2019 at 11:00 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for pointing.  I remember this thread, but don't remember
> details.  I'll reread it.

I think that doing this now would be a good idea:

"""
Defensively checking the page type (pending, posting, etc) within
scanPendingInsert() would be a good start. That's already something
that we do for posting trees. If we're following the same rules as
posting trees (which sounds like a good idea), then we should have the
same defenses. Same applies to using elogs within ginInsertCleanup()
-- we should promote those Assert()s to elog()s.
"""

In other words, ginInsertCleanup() should have defensive "Page types
match?" checks that are similar to the existing checks in
ginStepRight(). In both cases we're stepping right while coupling
locks, to avoid concurrent page deletion.

> > > Locking from right to left is clearly wrong.  It could deadlock with
> > > concurrent ginStepRight(), which locks from left to right.  I expect
> > > this happened in your case.  I'm going to reproduce this and fix.

> Frankly speaking I'm not very happy with special version of B-tree,
> which is builtin to GIN.  This version of B-tree is lacking of high
> keys.  AFAIR because of lack of high keys, we can't implement the same
> concurrency model as nbtree.  Instead we have to do super-locking for
> page deletion and so on.  That looks ridiculous for me.  I think in
> future we should somehow reimplement GIN on top of nbtree.

I think that that could work on top of the new nbtree posting list
stuff, provided that it was acceptable to not use posting list
compression in the main tree -- the way that posting list splits work
there needs to be able to think about free space in a very simple way
that is broken even by GIN's varbyte compression. Compression could
still be used in posting trees, though.

> But we have to provide some way less radical fixes for our stable
> releases.  In particular, I believe patch I've posted in this thread
> makes situation better not worse.  That is it fixes one bug without
> introducing mode bugs.  But I'm going to analyze more on this and
> document GIN concurrency better in the README.  Probably, I'll spot
> more details.

Thanks.
-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Peter Geoghegan
Дата:
On Sun, Sep 29, 2019 at 8:12 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I just managed to reproduce this using two sessions on master branch.
>
> session 1
>     session 2

Was the involvement of the pending list stuff in Chen's example just a
coincidence? Can you recreate the problem while eliminating that
factor (i.e. while setting fastupdate to off)?

Chen's example involved an INSERT that deadlocked against VACUUM --
not a SELECT. Is this just a coincidence?

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Mon, Sep 30, 2019 at 10:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sun, Sep 29, 2019 at 8:12 AM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > I just managed to reproduce this using two sessions on master branch.
> >
> > session 1
> >     session 2
>
> Was the involvement of the pending list stuff in Chen's example just a
> coincidence? Can you recreate the problem while eliminating that
> factor (i.e. while setting fastupdate to off)?
>
> Chen's example involved an INSERT that deadlocked against VACUUM --
> not a SELECT. Is this just a coincidence?

Chen wrote.

> Unfortunately the insert process(run by gcore) held no lwlock, it should be another process(we did not fetch core
file)that hold the lwlock needed for autovacuum process.
 

So, he catched backtrace for INSERT and post it for information.  But
since INSERT has no lwlocks held, it couldn't participate deadlock.
It was just side waiter.

I've rerun my reproduction case and it still deadlocks.  Just the same
steps but GIN index with (fastupdate = off).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Tue, Oct 1, 2019 at 5:55 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Mon, Sep 30, 2019 at 10:54 PM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Sun, Sep 29, 2019 at 8:12 AM Alexander Korotkov
> > <a.korotkov@postgrespro.ru> wrote:
> > > I just managed to reproduce this using two sessions on master branch.
> > >
> > > session 1
> > >     session 2
> >
> > Was the involvement of the pending list stuff in Chen's example just a
> > coincidence? Can you recreate the problem while eliminating that
> > factor (i.e. while setting fastupdate to off)?
> >
> > Chen's example involved an INSERT that deadlocked against VACUUM --
> > not a SELECT. Is this just a coincidence?
>
> Chen wrote.
>
> > Unfortunately the insert process(run by gcore) held no lwlock, it should be another process(we did not fetch core
file)that hold the lwlock needed for autovacuum process.
 
>
> So, he catched backtrace for INSERT and post it for information.  But
> since INSERT has no lwlocks held, it couldn't participate deadlock.
> It was just side waiter.
>
> I've rerun my reproduction case and it still deadlocks.  Just the same
> steps but GIN index with (fastupdate = off).

BTW, while trying to revise README I found another bug.  It appears to
be possible to reach deleted page from downlink.  The reproduction
case is following.

session 1
    session 2

# create table tmp (ar int[]) with (autovacuum_enabled = false);
# insert into tmp (select '{1}' from generate_series(1,10000000) i);
# insert into tmp values ('{1,2}');
# insert into tmp (select '{1}' from generate_series(1,10000000) i);
# create index tmp_idx on tmp using gin(ar);

    # delete from tmp;

# set max_parallel_workers_per_gather = 0;
/* Breakpoint where entyLoadMoreItems() calls ginFindLeafPage() to
search GIN posting tree  */
gdb> b ginget.c:682
gdb> select * from tmp where ar @> '{1,2}';
gdb> /* step till ReleaseAndReadBuffer() releases a buffer */

    # vacuum tmp;

# continue

It also appears that previous version of deadlock fix didn't supply
left sibling to leftmost child of any page.  As result, internal pages
were never deleted.  The first attached patch is revised fix is
attached.

The second patch fix traversing to deleted page using downlink.
Similarly to nbtree, we just always move right if landed on deleted
page.  Also, it appears that we clear all other flags while marking
page as deleted.  That cause assert to fire.  With patch, we just add
deleted flag without erasing others.  Also, I have to remove assert
that ginStepRight() never steps to deleted page.  If we landed to
deleted page from downlink, then we can find other deleted page by
rightlink.

I'm planning to continue work on README, comments and commit messages.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
Hi, Peter!

On Fri, Oct 4, 2019 at 12:05 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I'm planning to continue work on README, comments and commit messages.

It tool me a lot of efforts to revise a concurrency section in README.
I can't judge the result, but I probably did my best.  I'd like to
commit this patchset including both bug fixes and README update. But
I'd like you to take a look on the README patch first.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Peter Geoghegan
Дата:
Hi Alexander,

On Sun, Oct 27, 2019 at 7:04 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> It tool me a lot of efforts to revise a concurrency section in README.
> I can't judge the result, but I probably did my best.  I'd like to
> commit this patchset including both bug fixes and README update. But
> I'd like you to take a look on the README patch first.

Thank you for working on this.

I am flying back to the USA today, and will try to take a look at what
you came up with on the way. I will definitely have some feedback in
the next few days.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Mon, Oct 28, 2019 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Oct 27, 2019 at 7:04 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > It tool me a lot of efforts to revise a concurrency section in README.
> > I can't judge the result, but I probably did my best.  I'd like to
> > commit this patchset including both bug fixes and README update. But
> > I'd like you to take a look on the README patch first.
>
> Thank you for working on this.
>
> I am flying back to the USA today, and will try to take a look at what
> you came up with on the way. I will definitely have some feedback in
> the next few days.

Thank you so much!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Peter Geoghegan
Дата:
On Sun, Oct 27, 2019 at 12:04 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> (0001-Fix-deadlock-between-ginDeletePage-and-ginStepRigh-3.patch)

Some thoughts on the first patch:

* How do comparisons of items in each type of B-Tree work?

I would like to see a statement about invariants similar to nbtree's
"subtree" invariant. Looks like the "high key" is <= (not <) in each
case (i.e. for both entry trees and posting trees), just like nbtree.
nbtree has an explicit high key rather than using the current highest
data item on the page (maybe this can be called a "pseudo high key" or
something). That is one important difference, but in general GIN seems
to copy nbtree. I think.

However, GIN never explains stuff that must be affected by invariants
in binary search routines like entryLocateLeafEntry() and
dataLocateItem(). GIN never makes the similarities and differences
clear. Maybe you could do more on that -- it's too hard to tell why
entryLocateLeafEntry() and entryLocateEntry() (i.e. leaf and internal
page binary search variants for entry B-Trees) do things differently
to each other (and do things differently to nbtree's binary search
routine).

This code from entryLocateEntry() is a good example of how this can be
confusing:

        if (result == 0)
        {
            stack->off = mid;
            Assert(GinGetDownlink(itup) != GIN_ROOT_BLKNO);
            return GinGetDownlink(itup);
        }
        else if (result > 0)
            low = mid + 1;
        else
            high = mid;

Some questions about this code:

* Why is the "if (result == 0)" branch using the block number/downlink
from the "mid" tuple as its return value?

Why not follow nbtree's _bt_binsrch() here too, by returning "the last
key < scan key" on internal pages? If your high key from one level
down is <=, and also a pseudo high key (i.e. both a "real" entry and a
high key), how can it be correct to go to the block/downlink from an
equal "mid" tuple like this? Won't that make index scans miss the
pseudo high key, which they have to return to the scan?

* Why is it okay that there is no "negative infinity" item in internal
pages for the entry tree?

GIN does not have to copy nbtree in every detail to be correct, but it
confuses me that it *mostly* copies nbtree, but doesn't do so *fully*.
As I wrote just now (or at least implied), entryIsMoveRight() works in
a similar way to _bt_moveright(), and yet we still have these apparent
differences with how binary searches work that seems to not be
compatible with that. Maybe this is okay, but I cannot understand why
that is right now. It looks wrong to me -- so wrong that I suppose I
must be mistaken.

I also spotted some fairly minor issues:

* s/rightest/rightmost/

*  It's weird that GinBtree/GinBtreeData is a set of callbacks for
both posting trees and main entry trees, since the rules are now quite
different in each case. Not sure if it's worth saying something about
that.

*  "Exclusive lock" makes me think of "ExclusiveLock", which is a kind
of heavyweight lock (not a buffer lock). I suggest changing that
wording to avoid confusion.

In general, it seems very important to be clear about exactly how the
key space works. The nbtree work for v12 greatly benefitted from
defining comparisons in a way that didn't really change how nbtree
worked, while at the same time minimizing I/O and making nbtree
faithful to Lehman & Yao's original design. It isn't obvious how
valuable it is to really carefully define how invariants and key
comparisons work, but it seems possible to solve a lot of problems
that way.

--
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
Hi!

I'm sorry for late reply.  I was busy with various things.  Also
digging into these details took some time.  Please find my explanation
below.

On Wed, Oct 30, 2019 at 2:34 AM Peter Geoghegan <pg@bowt.ie> wrote:
> In general, it seems very important to be clear about exactly how the
> key space works. The nbtree work for v12 greatly benefitted from
> defining comparisons in a way that didn't really change how nbtree
> worked, while at the same time minimizing I/O and making nbtree
> faithful to Lehman & Yao's original design. It isn't obvious how
> valuable it is to really carefully define how invariants and key
> comparisons work, but it seems possible to solve a lot of problems
> that way.

gin packs downlinks and pivot key into tuples in the different way
than nbtree does.  Let's see the logical structure of internal B-tree
page.  We can represent it as following.

downlink_1, key_1_2, downlink_2, key_2_3, .... , downlink_n, highkey

key_{i-1}_i is pivot key, which splits key space between
downlink_{i-1} and downlink_i.  So, every key under downlink_{i-1} is
<= key_{i-1}_i.  Every key under downlink_i is > key_{i-1}_i.  And all
underlying keys are <= highkey.

nbtree packs that into tuples as followings (tuples are shown in parentheses).

(highkey), (-inf, downlink_1), (key_1_2, downlink_2), ...
(key_{n-1}_n,  downlink_n)

So, we store highkey separately.  key_{i-1}_i and downlink_i form a
tuple together.  downlink_1 is coupled with -inf key.

gin packs tuples in a different way.

(downlink_1, key_1_2), (downlink_2, key_2_3), ... , (downlink_n, highkey)

So, in GIN key_{i-1}_i and downlink_{i-1} form a tuple.  Highkey is
coupled with downlink_n.  And -inf key is not needed here.

But there is couple notes about highkey:
1) In entry tree rightmost page, a key coupled with downlink_n doesn't
really matter.  Highkey is assumed to be infinity.
2) In posting tree, a key coupled with downlink_n always doesn't
matter.  Highkey for non-rightmost pages is stored separately and
accessed via GinDataPageGetRightBound().

I think this explains following:
1) The invariants in gin btree are same as they are in nbtree.  Just
page layout is different.
2) The way entryLocateEntry() works.  In particular why it's OK to go
the mid downlink if corresponding key equals.
3) There is no "negative infinity" item in internal pages.

Does the explanation of above looks clear for you?  If so, I can put
it into the README.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Mon, Nov 11, 2019 at 2:42 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I'm sorry for late reply.  I was busy with various things.  Also
> digging into these details took some time.  Please find my explanation
> below.
>
> On Wed, Oct 30, 2019 at 2:34 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > In general, it seems very important to be clear about exactly how the
> > key space works. The nbtree work for v12 greatly benefitted from
> > defining comparisons in a way that didn't really change how nbtree
> > worked, while at the same time minimizing I/O and making nbtree
> > faithful to Lehman & Yao's original design. It isn't obvious how
> > valuable it is to really carefully define how invariants and key
> > comparisons work, but it seems possible to solve a lot of problems
> > that way.
>
> gin packs downlinks and pivot key into tuples in the different way
> than nbtree does.  Let's see the logical structure of internal B-tree
> page.  We can represent it as following.
>
> downlink_1, key_1_2, downlink_2, key_2_3, .... , downlink_n, highkey
>
> key_{i-1}_i is pivot key, which splits key space between
> downlink_{i-1} and downlink_i.  So, every key under downlink_{i-1} is
> <= key_{i-1}_i.  Every key under downlink_i is > key_{i-1}_i.  And all
> underlying keys are <= highkey.
>
> nbtree packs that into tuples as followings (tuples are shown in parentheses).
>
> (highkey), (-inf, downlink_1), (key_1_2, downlink_2), ...
> (key_{n-1}_n,  downlink_n)
>
> So, we store highkey separately.  key_{i-1}_i and downlink_i form a
> tuple together.  downlink_1 is coupled with -inf key.
>
> gin packs tuples in a different way.
>
> (downlink_1, key_1_2), (downlink_2, key_2_3), ... , (downlink_n, highkey)
>
> So, in GIN key_{i-1}_i and downlink_{i-1} form a tuple.  Highkey is
> coupled with downlink_n.  And -inf key is not needed here.
>
> But there is couple notes about highkey:
> 1) In entry tree rightmost page, a key coupled with downlink_n doesn't
> really matter.  Highkey is assumed to be infinity.
> 2) In posting tree, a key coupled with downlink_n always doesn't
> matter.  Highkey for non-rightmost pages is stored separately and
> accessed via GinDataPageGetRightBound().
>
> I think this explains following:
> 1) The invariants in gin btree are same as they are in nbtree.  Just
> page layout is different.
> 2) The way entryLocateEntry() works.  In particular why it's OK to go
> the mid downlink if corresponding key equals.
> 3) There is no "negative infinity" item in internal pages.
>
> Does the explanation of above looks clear for you?  If so, I can put
> it into the README.

So, I've put this explanation into README patch.  I just change
designation to better match with Lehman & Yao paper and did some minor
enchantments.

I'm going to push this patchset if no objections.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock(PG10.7)

От
Alexander Korotkov
Дата:
On Sun, Nov 17, 2019 at 9:18 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> So, I've put this explanation into README patch.  I just change
> designation to better match with Lehman & Yao paper and did some minor
> enchantments.
>
> I'm going to push this patchset if no objections.

So, pushed with minor changes during backpatching.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company