Обсуждение: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
The following bug has been logged online:
Bug reference: 5321
Logged by: Jerry Gamache
Email address: jerry.gamache@idilia.com
PostgreSQL version: 8.4.2
Operating system: Linux
Description: Parallel restore temporarily deadlocked by autovacuum
analyze
Details:
While restoring a database in 8.4.2, pg_restore stopped for more than 30
minutes doing nothing:
pg_restore -d database1 -j 8 -v database1.pg_dump
The parallel restore showed very impressive performance as the verbose
output went by, but the restore suddenly stopped. The output of "ps" after
15 minutes of 0% CPU activity showed 4 process in "Ss" state:
postgres: xxx database1 [local] CREATE INDEX waiting
postgres: xxx database1 [local] CREATE INDEX waiting
postgres: autovacuum worker process database1
postgres: autovacuum worker process database1
The output of pg_stat_activity for database1 showed:
database1=# select procpid,datname,current_query from pg_stat_activity where
datname='database1' ORDER BY procpid;
procpid | datname |
current_query
---------+------------------+-----------------------------------------------
----------------------------------------------------------------
6146 | database1 | autovacuum: ANALYZE public.table_x
6270 | database1 |
: CREATE UNIQUE INDEX index1 ON table_x USING
btree (col_a, col_b);
6271 | database1 |
: CREATE UNIQUE INDEX index2 ON table_x USING
btree (col_b, col_c, col_d);
6307 | database1 | autovacuum: ANALYZE public.table_y
6597 | database1 | <IDLE>
6626 | database1 | select procpid,datname,current_query from
pg_stat_activity where datname='database1' ORDER BY procpid;
(6 rows)
The output of pg_locks for the 4 stuck PIDs showed:
database1=# select * from pg_locks where pid IN (SELECT procpid FROM
pg_stat_activity WHERE datname='database1') ORDER BY pid;
locktype | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid |
mode | granted
------------+----------+----------+------+-------+------------+-------------
--+---------+-------+----------+--------------------+------+----------------
----------+---------
relation | 16993 | 17597 | | | |
| | | | 11/57 | 6146 | AccessShareLock
| t
relation | 16993 | 17616 | | | |
| | | | 11/57 | 6146 | AccessShareLock
| t
relation | 16993 | 17663 | | | |
| | | | 11/57 | 6146 | AccessShareLock
| t
virtualxid | | | | | 11/57 |
| | | | 11/57 | 6146 | ExclusiveLock
| t
relation | 16993 | 17165 | | | |
| | | | 11/57 | 6146 |
ShareUpdateExclusiveLock | t
virtualxid | | | | | 13/100 |
| | | | 13/100 | 6270 | ExclusiveLock
| t
relation | 16993 | 17165 | | | |
| | | | 13/100 | 6270 | ShareLock
| f
relation | 16993 | 17165 | | | |
| | | | 14/50 | 6271 | ShareLock
| f
virtualxid | | | | | 14/50 |
| | | | 14/50 | 6271 | ExclusiveLock
| t
relation | 16993 | 17188 | | | |
| | | | 3/844 | 6307 |
ShareUpdateExclusiveLock | t
virtualxid | | | | | 3/844 |
| | | | 3/844 | 6307 | ExclusiveLock
| t
relation | 16993 | 17660 | | | |
| | | | 3/844 | 6307 | AccessShareLock
| t
relation | 16993 | 17590 | | | |
| | | | 3/844 | 6307 | AccessShareLock
| t
relation | 0 | 2677 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
relation | 0 | 2676 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
relation | 16993 | 11042 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
relation | 0 | 2671 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
relation | 0 | 2672 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
virtualxid | | | | | 7/187 |
| | | | 7/187 | 6626 | ExclusiveLock
| t
relation | 16993 | 10969 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
relation | 0 | 1262 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
relation | 0 | 1260 | | | |
| | | | 7/187 | 6626 | AccessShareLock
| t
(22 rows)
The restore resumed while I was writing this report, and I saw these new
entries in the logs:
ERROR: canceling autovacuum task
CONTEXT: automatic analyze of table "database1.public.table_y"
ERROR: canceling autovacuum task
CONTEXT: automatic analyze of table "database1.public.table_y"
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
От
Alvaro Herrera
Дата:
Jerry Gamache wrote: > The restore resumed while I was writing this report, and I saw these new > entries in the logs: > > ERROR: canceling autovacuum task > CONTEXT: automatic analyze of table "database1.public.table_y" > ERROR: canceling autovacuum task > CONTEXT: automatic analyze of table "database1.public.table_y" This is what I would have expected to happen almost immediately ... what are your autovacuum settings? And what's your deadlock timeout? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Jerry Gamache wrote:
>> The restore resumed while I was writing this report, and I saw these new
>> entries in the logs:
>>
>> ERROR: canceling autovacuum task
>> CONTEXT: automatic analyze of table "database1.public.table_y"
>> ERROR: canceling autovacuum task
>> CONTEXT: automatic analyze of table "database1.public.table_y"
> This is what I would have expected to happen almost immediately ... what
> are your autovacuum settings? And what's your deadlock timeout?
The other details showed the blockage as being on table_x though.
Was that a typo in obscuring the details, or did the cancel really
happen on an unrelated table?
regards, tom lane
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
От
Alvaro Herrera
Дата:
Jerry Gamache wrote: > I was also surprised that table_y seemed to be involved. This is not > a typo. Might be caused by a FK constraint between table_y and > table_x. From the logs, the autovacuum on table_x was canceled > before the one on table_y, but the restore only resumed after the > autovacuum on table_y was canceled. It is possible (but I cannot > confirm) that the autovacuum thread on table_x was blocked for a > while after the cancellation message was written to the log. I added > timestamps to log_line_prefix to be able to give more details if > this happens again. Could you try to restore the whole dump again and see if it you can reproduce it? Maybe decreasing autovacuum_naptime makes it more probable. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
От
Alvaro Herrera
Дата:
Jerry Gamache wrote: > I was not able to repro with default parameters, or at 15s naptime, > and at 1s naptime I got only 1deadlock in 3 tests. > > This time the deadlock was with table_a, table_b and table_c > (table_x and table_y were not involved). > > 18395 | database1 | autovacuum: ANALYZE public.table_a > 18406 | database1 | autovacuum: ANALYZE public.table_b > 18510 | database1 | > : CREATE UNIQUE INDEX index_bg ON table_b > USING btree (col_g); > 18567 | database1 | autovacuum: ANALYZE public.table_c > 18802 | database1 | select procpid,datname,current_query from > pg_stat_activity where datname='database1' ORDER BY procpid; > > There is a FK constraint between table_a and table_b, but table_c > does not have any direct constraint relation with the other 2 > tables. > > The logs show that the autovacuum of table_b was canceled 20 minutes > ago, but the thread is still alive and blocked. That's pretty strange. Can we see a pg_locks snapshot? (Please attach as a text file so that it doesn't get word-wrapped) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
От
Alvaro Herrera
Дата:
Jerry Gamache wrote: > Here is the pg_locks output. > > Alvaro Herrera wrote: > >Jerry Gamache wrote: > >>The logs show that the autovacuum of table_b was canceled 20 minutes > >>ago, but the thread is still alive and blocked. Well, it's clearly locked on table_b, and that autovac is still running ... maybe it was a previous run that was cancelled? Do you have the PID in log_line_prefix? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Re: BUG #5321: Parallel restore temporarily deadlocked by autovacuum analyze
От
Alvaro Herrera
Дата:
Jerry Gamache wrote: > Yes, I have PID in the logs now. Problem was observed around 13h30, > and there was no log output between 13h18 and 13h30. > There are messages for table_b (pid 18350) and table_c (pid 18406), > but none for table_a. Eh, but according to the pg_locks snap you posted, the PID holding the table_b lock is (was?) 20490. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
All autovacuum and deadlock_timeout settings are at their default values (lines are commented out in postgresql.conf). Alvaro Herrera wrote: > Jerry Gamache wrote: > > >> The restore resumed while I was writing this report, and I saw these new >> entries in the logs: >> >> ERROR: canceling autovacuum task >> CONTEXT: automatic analyze of table "database1.public.table_y" >> ERROR: canceling autovacuum task >> CONTEXT: automatic analyze of table "database1.public.table_y" >> > > This is what I would have expected to happen almost immediately ... what > are your autovacuum settings? And what's your deadlock timeout? > >
I was also surprised that table_y seemed to be involved. This is not a typo. Might be caused by a FK constraint between table_y and table_x. From the logs, the autovacuum on table_x was canceled before the one on table_y, but the restore only resumed after the autovacuum on table_y was canceled. It is possible (but I cannot confirm) that the autovacuum thread on table_x was blocked for a while after the cancellation message was written to the log. I added timestamps to log_line_prefix to be able to give more details if this happens again. Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > >> Jerry Gamache wrote: >> >>> The restore resumed while I was writing this report, and I saw these new >>> entries in the logs: >>> >>> ERROR: canceling autovacuum task >>> CONTEXT: automatic analyze of table "database1.public.table_y" >>> ERROR: canceling autovacuum task >>> CONTEXT: automatic analyze of table "database1.public.table_y" >>> > > >> This is what I would have expected to happen almost immediately ... what >> are your autovacuum settings? And what's your deadlock timeout? >> > > The other details showed the blockage as being on table_x though. > Was that a typo in obscuring the details, or did the cancel really > happen on an unrelated table? > > regards, tom lane > >
I was not able to repro with default parameters, or at 15s naptime, and
at 1s naptime I got only 1deadlock in 3 tests.
This time the deadlock was with table_a, table_b and table_c (table_x
and table_y were not involved).
18395 | database1 | autovacuum: ANALYZE public.table_a
18406 | database1 | autovacuum: ANALYZE public.table_b
18510 | database1 |
: CREATE UNIQUE INDEX index_bg ON table_b
USING btree (col_g);
18567 | database1 | autovacuum: ANALYZE public.table_c
18802 | database1 | select procpid,datname,current_query from
pg_stat_activity where datname='database1' ORDER BY procpid;
There is a FK constraint between table_a and table_b, but table_c does
not have any direct constraint relation with the other 2 tables.
The logs show that the autovacuum of table_b was canceled 20 minutes
ago, but the thread is still alive and blocked.
Alvaro Herrera wrote:
> Jerry Gamache wrote:
>
>> I was also surprised that table_y seemed to be involved. This is not
>> a typo. Might be caused by a FK constraint between table_y and
>> table_x. From the logs, the autovacuum on table_x was canceled
>> before the one on table_y, but the restore only resumed after the
>> autovacuum on table_y was canceled. It is possible (but I cannot
>> confirm) that the autovacuum thread on table_x was blocked for a
>> while after the cancellation message was written to the log. I added
>> timestamps to log_line_prefix to be able to give more details if
>> this happens again.
>>
>
> Could you try to restore the whole dump again and see if it you can
> reproduce it? Maybe decreasing autovacuum_naptime makes it more
> probable.
>
>
Here is the pg_locks output.
Alvaro Herrera wrote:
> Jerry Gamache wrote:
>
>> I was not able to repro with default parameters, or at 15s naptime,
>> and at 1s naptime I got only 1deadlock in 3 tests.
>>
>> This time the deadlock was with table_a, table_b and table_c
>> (table_x and table_y were not involved).
>>
>> 18395 | database1 | autovacuum: ANALYZE public.table_a
>> 18406 | database1 | autovacuum: ANALYZE public.table_b
>> 18510 | database1 |
>> : CREATE UNIQUE INDEX index_bg ON table_b
>> USING btree (col_g);
>> 18567 | database1 | autovacuum: ANALYZE public.table_c
>> 18802 | database1 | select procpid,datname,current_query from
>> pg_stat_activity where datname='database1' ORDER BY procpid;
>>
>> There is a FK constraint between table_a and table_b, but table_c
>> does not have any direct constraint relation with the other 2
>> tables.
>>
>> The logs show that the autovacuum of table_b was canceled 20 minutes
>> ago, but the thread is still alive and blocked.
>>
>
> That's pretty strange. Can we see a pg_locks snapshot? (Please attach
> as a text file so that it doesn't get word-wrapped)
>
>
database1=# select locktype,database,relation,virtualxid,virtualtransaction,pid,mode,granted from pg_locks where pid IN
(SELECTprocpid FROM pg_stat_activity WHERE datname='database1') ORDER BY pid;
locktype | database | relation | virtualxid | virtualtransaction | pid | mode | granted
------------+----------+----------+------------+--------------------+-------+--------------------------+---------
relation | 20162 | 20893 | | 4/72 | 18395 | AccessShareLock | t
virtualxid | | | 4/72 | 4/72 | 18395 | ExclusiveLock | t
relation | 20162 | 20829 | | 4/72 | 18395 | AccessShareLock | t
relation | 20162 | 20894 | | 4/72 | 18395 | AccessShareLock | t
relation | 20162 | 20892 | | 4/72 | 18395 | AccessShareLock | t
relation | 20162 | 20515 | | 4/72 | 18395 | ShareUpdateExclusiveLock | t
relation | 20162 | 20891 | | 3/53 | 18406 | AccessShareLock | t
relation | 20162 | 20813 | | 3/53 | 18406 | AccessShareLock | t
relation | 20162 | 20490 | | 3/53 | 18406 | ShareUpdateExclusiveLock | t
virtualxid | | | 3/53 | 3/53 | 18406 | ExclusiveLock | t
relation | 20162 | 20490 | | 8/162 | 18510 | ShareLock | f
virtualxid | | | 8/162 | 8/162 | 18510 | ExclusiveLock | t
relation | 20162 | 20247 | | 7/238 | 18567 | ShareUpdateExclusiveLock | t
relation | 20162 | 20872 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20881 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20880 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20878 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20873 | | 7/238 | 18567 | AccessShareLock | t
virtualxid | | | 7/238 | 7/238 | 18567 | ExclusiveLock | t
relation | 20162 | 20876 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20882 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20879 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20797 | | 7/238 | 18567 | AccessShareLock | t
relation | 20162 | 20877 | | 7/238 | 18567 | AccessShareLock | t
relation | 0 | 2676 | | 1/611 | 18802 | AccessShareLock | t
relation | 0 | 2672 | | 1/611 | 18802 | AccessShareLock | t
virtualxid | | | 1/611 | 1/611 | 18802 | ExclusiveLock | t
relation | 0 | 1262 | | 1/611 | 18802 | AccessShareLock | t
relation | 20162 | 11042 | | 1/611 | 18802 | AccessShareLock | t
relation | 0 | 2671 | | 1/611 | 18802 | AccessShareLock | t
relation | 0 | 2677 | | 1/611 | 18802 | AccessShareLock | t
relation | 20162 | 10969 | | 1/611 | 18802 | AccessShareLock | t
relation | 0 | 1260 | | 1/611 | 18802 | AccessShareLock | t
(33 rows)
database1=# select procpid,datname,current_query from pg_stat_activity where datname='database1' ORDER BY procpid;
procpid | datname | current_query
---------+------------------+---------------------------------------------------------------------------------------------------------------
18395 | database1 | autovacuum: ANALYZE public.table_a
18406 | database1 | autovacuum: ANALYZE public.table_b
18510 | database1 |
: CREATE UNIQUE INDEX index_bg ON table_b USING btree (col_g);
18567 | database1 | autovacuum: ANALYZE public.table_c
18802 | database1 | select procpid,datname,current_query from pg_stat_activity where datname='database1' ORDER BY
procpid;
(5 rows)
Yes, I have PID in the logs now. Problem was observed around 13h30, and there was no log output between 13h18 and 13h30. There are messages for table_b (pid 18350) and table_c (pid 18406), but none for table_a. Alvaro Herrera wrote: >>>> The logs show that the autovacuum of table_b was canceled 20 minutes >>>> ago, but the thread is still alive and blocked. >>>> > > Well, it's clearly locked on table_b, and that autovac is still running > ... maybe it was a previous run that was cancelled? Do you have the PID > in log_line_prefix? > > 2010-02-10 13:10:26 EST::@:[18301]:ERROR: canceling autovacuum task 2010-02-10 13:10:26 EST::@:[18301]:CONTEXT: automatic analyze of table "database1.public.table_d" 2010-02-10 13:10:42 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:10:42 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_e" 2010-02-10 13:10:58 EST::@:[18320]:ERROR: canceling autovacuum task 2010-02-10 13:10:58 EST::@:[18320]:CONTEXT: automatic analyze of table "database1.public.table_f" 2010-02-10 13:11:00 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:11:00 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_g" 2010-02-10 13:11:08 EST::@:[18320]:ERROR: canceling autovacuum task 2010-02-10 13:11:08 EST::@:[18320]:CONTEXT: automatic analyze of table "database1.public.table_h" 2010-02-10 13:11:17 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:11:17 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_i" 2010-02-10 13:11:22 EST::@:[18320]:ERROR: canceling autovacuum task 2010-02-10 13:11:22 EST::@:[18320]:CONTEXT: automatic analyze of table "database1.public.table_j" 2010-02-10 13:11:42 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:11:42 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_b" 2010-02-10 13:13:39 EST::@:[18406]:ERROR: canceling autovacuum task 2010-02-10 13:13:39 EST::@:[18406]:CONTEXT: automatic analyze of table "database1.public.table_e" 2010-02-10 13:13:39 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:13:39 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_g" 2010-02-10 13:14:07 EST::@:[18406]:ERROR: canceling autovacuum task 2010-02-10 13:14:07 EST::@:[18406]:CONTEXT: automatic analyze of table "database1.public.table_c" 2010-02-10 13:14:49 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:14:49 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_h" 2010-02-10 13:14:49 EST::@:[18406]:ERROR: canceling autovacuum task 2010-02-10 13:14:49 EST::@:[18406]:CONTEXT: automatic analyze of table "database1.public.table_k" 2010-02-10 13:14:49 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:14:49 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_l" 2010-02-10 13:16:25 EST::@:[18350]:ERROR: canceling autovacuum task 2010-02-10 13:16:25 EST::@:[18350]:CONTEXT: automatic analyze of table "database1.public.table_m" 2010-02-10 13:17:13 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:17:13 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_i" 2010-02-10 13:18:20 EST::@:[18567]:ERROR: canceling autovacuum task 2010-02-10 13:18:20 EST::@:[18567]:CONTEXT: automatic analyze of table "database1.public.table_e" 2010-02-10 13:18:31 EST::@:[18395]:ERROR: canceling autovacuum task 2010-02-10 13:18:31 EST::@:[18395]:CONTEXT: automatic analyze of table "database1.public.table_l"
Alvaro Herrera wrote: > Jerry Gamache wrote: > >> Yes, I have PID in the logs now. Problem was observed around 13h30, >> and there was no log output between 13h18 and 13h30. >> There are messages for table_b (pid 18350) and table_c (pid 18406), >> but none for table_a. >> > > Eh, but according to the pg_locks snap you posted, the PID holding the > table_b lock is (was?) 20490. > > 20490 is the relid of table_b. The PID holding the lock on table_b is 18406.