Обсуждение: GIN fast insert database hang

Поиск
Список
Период
Сортировка

GIN fast insert database hang

От
Robert Haas
Дата:
While fooling around with the GIN fast insert patch tonight, I managed
to hang my test database.  :-(

I'm going to try to reproduce this, but here's approximately what I did.

create table foo (id serial, x int[], primary key (id));
create index foo_gin on foo using gin (x);
insert into foo (x) select array[(random() * 100)::int, (random() *
90)::int, (random()*80)::int] from generate_series(1,1000000);

The last insert command was run multiple times with various numbers in
place of the constant 1000000 through several autovacuum cycles.
Eventually, though, it froze up.

Here's pg_stat_activity, with apologies for the crappy formatting:
datid | datname | procpid | usesysid | usename |                           current_query                    | waiting |
        xact_start           |          query_
 
start          |         backend_start         | client_addr | client_port
-------+---------+---------+----------+---------+-------------------------------
--------------------------------------------------------------------------------
---------------------+---------+-------------------------------+----------------
---------------+-------------------------------+-------------+-------------16384 | rhaas   |   15071 |       10 | rhaas
 | insert into foo (x) select arr
 
ay[(random() * 100)::int, (random() * 90)::int, (random()*80)::int] from generat
e_series(1,1000000); | f       | 2009-02-11 21:42:24.984759-05 | 2009-02-11 21:4
2:24.984759-05 | 2009-02-11 21:31:45.061903-05 |             |          -116384 | rhaas   |   15530 |       10 | rhaas
| autovacuum: VACUUM public.foo
 
                    | f       | 2009-02-11 21:42:10.046085-05 | 2009-02-11 21:4
2:10.046085-05 | 2009-02-11 21:42:09.947378-05 |             |16384 | rhaas   |   15780 |       10 | rhaas   | explain
analyzeselect sum(1)
 
from foo where array[1] <@ x;                    | f       | 2009-02-11 21:43:20.518837-05 | 2009-02-11 21:4
3:20.518837-05 | 2009-02-11 21:43:18.717031-05 |             |          -116384 | rhaas   |   15994 |       10 | rhaas
| <IDLE>
 
                    | f       |                               | 2009-02-11 21:4

Neither the INSERT nor the EXPLAIN ANALYZE nor the autovacuum died
when sent SIGINT or SIGTERM.  They all seem to be hung on semop:

$ strace -p 15071
Process 15071 attached - interrupt to quit
semop(3866630, 0x7fff14a46be0, 1^C <unfinished ...>
Process 15071 detached
$ strace -p 15530
Process 15530 attached - interrupt to quit
semop(3866630, 0x7fff14a45740, 1^C <unfinished ...>
Process 15530 detached
$ strace -p 15780;
Process 15780 attached - interrupt to quit
semop(3866630, 0x7fff14a48c50, 1^C <unfinished ...>
Process 15780 detached
$ uname -a
Linux njpen7.snipinc.net 2.6.27.12-78.2.8.fc9.x86_64 #1 SMP Mon Jan 19
19:25:03 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/fedora-release
Fedora release 9 (Sulphur)

Backtrace of pid 15071:

#0  0x0000003b1bce6257 in semop () from /lib64/libc.so.6
#1  0x00000000005a8117 in PGSemaphoreLock ()
#2  0x00000000005c6409 in LockBufferForCleanup ()
#3  0x00000000004944d1 in ginInsertCleanup ()
#4  0x000000000049501d in ginHeapTupleFastInsert ()
#5  0x000000000048a772 in gininsert ()
#6  0x000000000068fb39 in FunctionCall6 ()
#7  0x0000000000469181 in index_insert ()
#8  0x0000000000540ee2 in ExecInsertIndexTuples ()
#9  0x0000000000536504 in standard_ExecutorRun ()
#10 0x00000000005e07e1 in ProcessQuery ()
#11 0x00000000005e0a06 in PortalRunMulti ()
#12 0x00000000005e11b2 in PortalRun ()
#13 0x00000000005dc507 in exec_simple_query ()
#14 0x00000000005ddad7 in PostgresMain ()
#15 0x00000000005b2e58 in ServerLoop ()
#16 0x00000000005b3b39 in PostmasterMain ()
#17 0x000000000055e1e8 in main ()

Backtrace of pid 15530:
#0  0x0000003b1bce6257 in semop () from /lib64/libc.so.6
#1  0x00000000005a8117 in PGSemaphoreLock ()
#2  0x00000000005d624d in LWLockAcquire ()
#3  0x0000000000494322 in ginInsertCleanup ()
#4  0x00000000004922ed in ginvacuumcleanup ()
#5  0x000000000068fddc in FunctionCall2 ()
#6  0x0000000000530e1e in lazy_scan_heap ()
#7  0x00000000005312d0 in lazy_vacuum_rel ()
#8  0x000000000052f935 in vacuum_rel ()
#9  0x000000000052fd4c in vacuum ()
#10 0x00000000005aa338 in do_autovacuum ()
#11 0x00000000005aa62d in AutoVacWorkerMain ()
#12 0x00000000005aa756 in StartAutoVacWorker ()
#13 0x00000000005b525e in sigusr1_handler ()
#14 <signal handler called>
#15 0x0000003b1bcdcbb3 in __select_nocancel () from /lib64/libc.so.6
#16 0x00000000005b281c in ServerLoop ()
#17 0x00000000005b3b39 in PostmasterMain ()
#18 0x000000000055e1e8 in main ()

Backtrace of pid 15780:
#0  0x0000003b1bce6257 in semop () from /lib64/libc.so.6
#1  0x00000000005a8117 in PGSemaphoreLock ()
#2  0x00000000005d624d in LWLockAcquire ()
#3  0x00000000004913ac in scanFastInsert ()
#4  0x000000000049201c in gingetbitmap ()
#5  0x000000000068fddc in FunctionCall2 ()
#6  0x000000000046878e in index_getbitmap ()
#7  0x00000000005459c5 in MultiExecBitmapIndexScan ()
#8  0x00000000005455f4 in BitmapHeapNext ()
#9  0x000000000053f492 in ExecScan ()
#10 0x0000000000538360 in ExecProcNode ()
#11 0x000000000054435a in ExecAgg ()
#12 0x00000000005382a0 in ExecProcNode ()
#13 0x0000000000535ed2 in standard_ExecutorRun ()
#14 0x00000000004f9b2f in ExplainOnePlan ()
#15 0x00000000004f9d4e in ExplainQuery ()
#16 0x00000000005df8f7 in PortalRunUtility ()
#17 0x00000000005e0c3c in FillPortalStore ()
#18 0x00000000005e123d in PortalRun ()
#19 0x00000000005dc507 in exec_simple_query ()
#20 0x00000000005ddad7 in PostgresMain ()
#21 0x00000000005b2e58 in ServerLoop ()
#22 0x00000000005b3b39 in PostmasterMain ()
#23 0x000000000055e1e8 in main ()

Since fast shutdown failed, I tried an immediate shutdown, which did
work, but the database won't start back up:

LOG:  database system was interrupted; last known up at 2009-02-11 21:42:38 EST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/3600E050
FATAL:  bad buffer id: 0
CONTEXT:  xlog redo Delete list page (16), node: 19425/19426/19427 blkno: 19429
LOG:  startup process (PID 17314) exited with exit code 1
LOG:  aborting startup due to startup process failure

Ugh...

...Robert


Re: GIN fast insert database hang

От
Robert Haas
Дата:
On Wed, Feb 11, 2009 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm going to try to reproduce this, but here's approximately what I did.

OK, I've managed to build a reproducible test case.  Initial setup is
just as I had before:

> create table foo (id serial, x int[], primary key (id));
> create index foo_gin on foo using gin (x);

Then just start these two commands running in different windows and wait:

while true; do psql -c "explain analyze select sum(1) from foo where
array[1] <@ x"; done
while true; do psql -c "insert into foo (x) select array[(random() *
100)::int, (random() * 90)::int, (random()*80)::int] from
generate_series(1,100000);"; done

I did this four times, sometimes with the variant of adding "set
enable_bitmapscan to false;" before the explain analyze.  In three
cases the database recovered succesfully after the immediate shutdown;
in the other case, it crapped out much as described in my original
email.

...Robert


Re: GIN fast insert database hang

От
Robert Haas
Дата:
> I did this four times, sometimes with the variant of adding "set
> enable_bitmapscan to false;" before the explain analyze.  In three
> cases the database recovered succesfully after the immediate shutdown;
> in the other case, it crapped out much as described in my original
> email.

Sorry to keep replying to myself, but I've figured that autovacuum is
not required to trigger this bug.  In fact, I can reliably trigger it
much more quickly just by starting two concurrent copies of:

psql -c "insert into foo (x) select array[(random() * 100)::int,
(random() * 90)::int, (random()*80)::int] from
generate_series(1,100000);"

This freezes the whole system even with autovacuum = off in
postgresql.conf.  As before, the backends wait on a semop() call.

I was able to reproduce the recovery failure this way once as well,
but that part of the problem seems to be much more erratic.  Most of
the time after an immediate shutdown, pg_ctl start triggers recovery
followed by normal startup.

...Robert


Re: GIN fast insert database hang

От
Teodor Sigaev
Дата:
> This freezes the whole system even with autovacuum = off in
> postgresql.conf.  As before, the backends wait on a semop() call.
Fixed. There was a deadlock of LockBufferForCleanup and LockBuffer(SHARE).
Redesign that place to downgrade  LockBufferForCleanup to LockBuffer(EXCLUSIVE)
with correction of page's locking during scan of pending list.


> I was able to reproduce the recovery failure this way once as well,
> but that part of the problem seems to be much more erratic.  Most of
Fixed too. I missed comments on XLogInsert:
  * NB: this routine feels free to scribble on the XLogRecData structs,
  * though not on the data they reference.  This is OK since the XLogRecData
  * structs are always just temporaries in the calling code.

and I reused once initialized XLogRecData many times in a loop.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Вложения