Обсуждение: autovacuum locks

Поиск
Список
Период
Сортировка

autovacuum locks

От
Gregg Jaskiewicz
Дата:
Folks,

I got a system here (8.3.7), that is locked up. Few queries waiting
for autovacuum aquired locks on a table or two.
But it looks like autovacuum is also waiting for some semaphore:


#0  0x00f07410 in __kernel_vsyscall ()
#1  0x00252d2b in semop () from /lib/libc.so.6
#2  0x081ca1ce in PGSemaphoreLock ()
#3  0x081f590e in ProcWaitForSignal ()
#4  0x081e8dea in LockBufferForCleanup ()
#5  0x08160e68 in ?? ()
#6  0x08161482 in lazy_vacuum_rel ()
#7  0x081603b5 in ?? ()
#8  0x081605c4 in vacuum ()
#9  0x081cd154 in ?? ()
#10 0x081cd7f1 in ?? ()
#11 0x081cd911 in StartAutoVacWorker ()
#12 0x081d6d47 in ?? ()
#13 <signal handler called>
#14 0x00f07410 in __kernel_vsyscall ()
#15 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6
#16 0x081d3964 in ?? ()
#17 0x081d5105 in PostmasterMain ()
#18 0x0818be60 in main ()

There's still other transcations going on , and working fine - but two
tables seem to be cut out by autovacuum, and other queries stuck
waiting for autovacuum to finish.

Any ideas ?
Is there a possible fix up the 8.3 stream, that would target that ? (I
need leverage  to upgrade to 8.3.18, and if that's the case - happy
days).


-- 
GJ


Re: autovacuum locks

От
Alvaro Herrera
Дата:
Excerpts from Gregg Jaskiewicz's message of vie mar 02 07:44:07 -0300 2012:
> Folks,
>
> I got a system here (8.3.7), that is locked up. Few queries waiting
> for autovacuum aquired locks on a table or two.
> But it looks like autovacuum is also waiting for some semaphore:
>
>
> #0  0x00f07410 in __kernel_vsyscall ()
> #1  0x00252d2b in semop () from /lib/libc.so.6
> #2  0x081ca1ce in PGSemaphoreLock ()
> #3  0x081f590e in ProcWaitForSignal ()
> #4  0x081e8dea in LockBufferForCleanup ()
> #5  0x08160e68 in ?? ()
> #6  0x08161482 in lazy_vacuum_rel ()
> #7  0x081603b5 in ?? ()
> #8  0x081605c4 in vacuum ()
> #9  0x081cd154 in ?? ()
> #10 0x081cd7f1 in ?? ()
> #11 0x081cd911 in StartAutoVacWorker ()
> #12 0x081d6d47 in ?? ()
> #13 <signal handler called>
> #14 0x00f07410 in __kernel_vsyscall ()
> #15 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6
> #16 0x081d3964 in ?? ()
> #17 0x081d5105 in PostmasterMain ()
> #18 0x0818be60 in main ()
>
> There's still other transcations going on , and working fine - but two
> tables seem to be cut out by autovacuum, and other queries stuck
> waiting for autovacuum to finish.

This vacuum is waiting for some other process to release the buffer
lock.  These locks are usually very short-lived, but maybe you have some
other backend doing something weird that won't release it.  Do you have
idle processes with transactions open, and if you have, what are they
doing?  Do you have prepared transactions open? (see the
pg_prepared_xacts view)

> Any ideas ?
> Is there a possible fix up the 8.3 stream, that would target that ?

I don't think so, though I am not certain.

> (I need leverage  to upgrade to 8.3.18, and if that's the case - happy
> days).

Sorry, can't help you there :-)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum locks

От
Gregg Jaskiewicz
Дата:
On 2 March 2012 11:03, Alvaro Herrera <alvherre@commandprompt.com> wrote:
>
> Excerpts from Gregg Jaskiewicz's message of vie mar 02 07:44:07 -0300 2012:
>> Folks,
>>
>> I got a system here (8.3.7), that is locked up. Few queries waiting
>> for autovacuum aquired locks on a table or two.
>> But it looks like autovacuum is also waiting for some semaphore:
>>
>>
>> #0  0x00f07410 in __kernel_vsyscall ()
>> #1  0x00252d2b in semop () from /lib/libc.so.6
>> #2  0x081ca1ce in PGSemaphoreLock ()
>> #3  0x081f590e in ProcWaitForSignal ()
>> #4  0x081e8dea in LockBufferForCleanup ()
>> #5  0x08160e68 in ?? ()
>> #6  0x08161482 in lazy_vacuum_rel ()
>> #7  0x081603b5 in ?? ()
>> #8  0x081605c4 in vacuum ()
>> #9  0x081cd154 in ?? ()
>> #10 0x081cd7f1 in ?? ()
>> #11 0x081cd911 in StartAutoVacWorker ()
>> #12 0x081d6d47 in ?? ()
>> #13 <signal handler called>
>> #14 0x00f07410 in __kernel_vsyscall ()
>> #15 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6
>> #16 0x081d3964 in ?? ()
>> #17 0x081d5105 in PostmasterMain ()
>> #18 0x0818be60 in main ()
>>
>> There's still other transcations going on , and working fine - but two
>> tables seem to be cut out by autovacuum, and other queries stuck
>> waiting for autovacuum to finish.
>
> This vacuum is waiting for some other process to release the buffer
> lock.  These locks are usually very short-lived, but maybe you have some
> other backend doing something weird that won't release it.  Do you have
> idle processes with transactions open, and if you have, what are they
> doing?  Do you have prepared transactions open? (see the
> pg_prepared_xacts view)

That view is empty, but I know that the queries that are waiting for
locks to be released - are in the middle of two phase commit
transaction.

It could be that the system is doing something, but it is just so slow
- that those locks are starved - and no longer so short lived :)

Is there any particular info I could gather to make it easier for you
guys to look at ?

Here's a backtrace from one of the autovacuums (this time with line # info):
#0  0x00f07410 in __kernel_vsyscall ()
#1  0x00252d2b in semop () from /lib/libc.so.6
#2  0x081ca1ce in PGSemaphoreLock (sema=0xb792bdf8, interruptOK=1
'\001') at pg_sema.c:420
#3  0x081f590e in ProcWaitForSignal () at proc.c:1269
#4  0x081e8dea in LockBufferForCleanup (buffer=24516) at bufmgr.c:2304
#5  0x08160e68 in lazy_vacuum_heap (onerel=0x87573c0,
vacrelstats=0x86a4768) at vacuumlazy.c:655
#6  0x08161482 in lazy_scan_heap (onerel=0x87573c0,
vacstmt=0xbfbcfb50, bstrategy=0x86fbe98) at vacuumlazy.c:591
#7  lazy_vacuum_rel (onerel=0x87573c0, vacstmt=0xbfbcfb50,
bstrategy=0x86fbe98) at vacuumlazy.c:191
#8  0x081603b5 in vacuum_rel (relid=<value optimized out>,
vacstmt=0xbfbcfb50, expected_relkind=114 'r', for_wraparound=0 '\000')
at vacuum.c:1126
#9  0x081605c4 in vacuum (vacstmt=0xbfbcfb50, relids=0x86fd310,
bstrategy=0x86fbe98, for_wraparound=0 '\000', isTopLevel=1 '\001') at
vacuum.c:427
#10 0x081cd154 in do_autovacuum () at autovacuum.c:2638
#11 0x081cd7f1 in AutoVacWorkerMain (argc=<value optimized out>,
argv=<value optimized out>) at autovacuum.c:1613
#12 0x081cd911 in StartAutoVacWorker () at autovacuum.c:1416
#13 0x081d6d47 in StartAutovacuumWorker (postgres_signal_arg=10) at
postmaster.c:4109
#14 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:3845
#15 <signal handler called>
#16 0x00f07410 in __kernel_vsyscall ()
#17 0x00249fcd in ___newselect_nocancel () from /lib/libc.so.6
#18 0x081d3964 in ServerLoop () at postmaster.c:1234
#19 0x081d5105 in PostmasterMain (argc=5, argv=0x8641828) at postmaster.c:1029
#20 0x0818be60 in main (argc=5, argv=0x0) at main.c:188


And the connection that is waiting for locks to be released (the
locked up connection). That query is hanging there since ~18:00 BST:

#0  0x00f07410 in __kernel_vsyscall ()
#1  0x00252d2b in semop () from /lib/libc.so.6
#2  0x081ca1ce in PGSemaphoreLock (sema=0xb7929618, interruptOK=1
'\001') at pg_sema.c:420
#3  0x081f5c53 in ProcSleep (locallock=0x86578f0,
lockMethodTable=0x834e0f4) at proc.c:880
#4  0x081f4a5d in WaitOnLock (locallock=0x86578f0, owner=0x88676d8) at
lock.c:1147
#5  0x081f50af in LockAcquire (locktag=0xbfbcf2b8, lockmode=7,
sessionLock=0 '\000', dontWait=0 '\000') at lock.c:792
#6  0x081f27f9 in LockTuple (relation=0xa72f53a8, tid=0xbfbcf334,
lockmode=7) at lmgr.c:378
#7  0x0809b19c in heap_update (relation=0xa72f53a8, otid=0xbfbcf3da,
newtup=0x86d3be0, ctid=0xbfbcf3d4, update_xmax=0xbfbcf3e0, cid=2,
crosscheck=0x0, wait=1 '\001') at heapam.c:2376
#8  0x08167f8b in ExecutePlan (queryDesc=0x870d010,
direction=ForwardScanDirection, count=0) at execMain.c:1880
#9  ExecutorRun (queryDesc=0x870d010, direction=ForwardScanDirection,
count=0) at execMain.c:270
#10 0x0818051e in _SPI_execute_plan (plan=0x87b8948, Values=0x8e960f8,
Nulls=0x8e960e8 "  ", snapshot=0x0, crosscheck_snapshot=0x0,
read_only=0 '\000', fire_triggers=1 '\001', tcount=0) at spi.c:1845
#11 0x08180796 in SPI_execute_plan (plan=0x87b8948, Values=0x8e960f8,
Nulls=0x8e960e8 "  ", read_only=0 '\000', tcount=0) at spi.c:374
#12 0x0012d488 in exec_stmt_execsql (estate=0xbfbcf7bc,
stmt=0x87a2630) at pl_exec.c:2546
#13 0x0012f0cc in exec_stmt (estate=0xbfbcf7bc, stmts=<value optimized
out>) at pl_exec.c:1260
#14 exec_stmts (estate=0xbfbcf7bc, stmts=<value optimized out>) at
pl_exec.c:1171
#15 0x00130d19 in exec_stmt_block (estate=0xbfbcf7bc, block=0x87a2778)
at pl_exec.c:1114
#16 0x001310e7 in plpgsql_exec_trigger (func=0x8844ae0,
trigdata=0xbfbcfb78) at pl_exec.c:636
#17 0x00127082 in plpgsql_call_handler (fcinfo=0xbfbcf934) at pl_handler.c:92
#18 0x0814a63a in ExecCallTriggerFunc (trigdata=0xbfbcfb78, tgindx=8,
finfo=0x12800b, instr=0x0, per_tuple_context=0x87f5fd8) at
trigger.c:1577
#19 0x0814b69b in ExecBRUpdateTriggers (estate=0x885bd28,
relinfo=0x885bdb0, tupleid=0xbfbcfc2a, newtuple=0x8899310) at
trigger.c:2000
#20 0x0816820e in ExecutePlan (queryDesc=0x8ab81e8,
direction=ForwardScanDirection, count=0) at execMain.c:1835
#21 ExecutorRun (queryDesc=0x8ab81e8, direction=ForwardScanDirection,
count=0) at execMain.c:270
#22 0x08203054 in ProcessQuery (plan=0x8ab7f18, params=<value
optimized out>, dest=0x8ab81a0, completionTag=0xbfbcfcc8 "") at
pquery.c:179
#23 0x08203251 in PortalRunMulti (portal=0x86eb0b0, isTopLevel=<value
optimized out>, dest=0x8ab81a0, altdest=0x83bf95c,
completionTag=0xbfbcfcc8 "") at pquery.c:1242
#24 0x0820345d in FillPortalStore (portal=0x86eb0b0, isTopLevel=1
'\001') at pquery.c:1045
#25 0x08203a31 in PortalRun (portal=0x86eb0b0, count=2147483647,
isTopLevel=1 '\001', dest=0x86a16d0, altdest=0x86a16d0,
completionTag=0xbfbcffd8 "") at pquery.c:792
#26 0x08200154 in exec_execute_message (argc=4, argv=<value optimized
out>, username=0x8644890 "foo") at postgres.c:1942
#27 PostgresMain (argc=4, argv=<value optimized out>,
username=0x8644890 "foo") at postgres.c:3688
#28 0x081d4228 in BackendStartup () at postmaster.c:3207
#29 ServerLoop () at postmaster.c:1274
#30 0x081d5105 in PostmasterMain (argc=5, argv=0x8641828) at postmaster.c:1029
#31 0x0818be60 in main (argc=5, argv=0x4489ffff) at main.c:188



--
GJ


Re: autovacuum locks

От
Gregg Jaskiewicz
Дата:
Looking at the system bit more now, it look like 'waiting' states are
changing for both the query and autovacuum in pg_stat_activity.
But very slowly. It looks like they both got into that sort of state
that keeps them on the edge of starvation.

So this isn't really a deadlock, but rather very poor performance in
this corner case.

This is a test system, so I can harvest some more data from it.
Already got core files for autovacuum and query connections.
I'm hoping for more suggestions as to how to gather more intel - so
that could be useful to you guys.

Unfortunately my knowledge about locking mechanisms in postgresql is
rather poor, other then the basic one.


Re: autovacuum locks

От
Alvaro Herrera
Дата:
Excerpts from Gregg Jaskiewicz's message of vie mar 02 08:22:22 -0300 2012:
>
> Looking at the system bit more now, it look like 'waiting' states are
> changing for both the query and autovacuum in pg_stat_activity.
> But very slowly. It looks like they both got into that sort of state
> that keeps them on the edge of starvation.

Ouch.

> So this isn't really a deadlock, but rather very poor performance in
> this corner case.

Right.  I think I can explain how this locking works: autovacuum needs a
"cleanup" lock on the page being processed, which is a special exclusive
lock which also requires that no one is holding a "pin" on the buffer.
Any process running a query holds a pin on the buffer while inspecting
tuples on it; when it's done with tuples on that page it should move on
to the next page in the table -- same as autovac.  So what seems to be
happening here is that the autovac and the scan are in sync walking the
table, stepping on each others toes.

What I don't know is why they are so closely in sync.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum locks

От
Robert Haas
Дата:
On Fri, Mar 2, 2012 at 6:22 AM, Gregg Jaskiewicz <gryzman@gmail.com> wrote:
> Looking at the system bit more now, it look like 'waiting' states are
> changing for both the query and autovacuum in pg_stat_activity.
> But very slowly. It looks like they both got into that sort of state
> that keeps them on the edge of starvation.
>
> So this isn't really a deadlock, but rather very poor performance in
> this corner case.

Are you making any use of cursors?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: autovacuum locks

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Right.  I think I can explain how this locking works: autovacuum needs a
> "cleanup" lock on the page being processed, which is a special exclusive
> lock which also requires that no one is holding a "pin" on the buffer.
> Any process running a query holds a pin on the buffer while inspecting
> tuples on it; when it's done with tuples on that page it should move on
> to the next page in the table -- same as autovac.  So what seems to be
> happening here is that the autovac and the scan are in sync walking the
> table, stepping on each others toes.

I don't believe that.  The trace shows the other process is waiting for
a tuple lock, which is not something that autovacuum would take.

Given the reference to prepared transactions, what seems likely is that
the UPDATE command is blocked by a prepared transaction (IOW, one that
already updated the same tuple) and VACUUM is stuck behind the UPDATE.
So the real problem is slow removal of prepared transactions, which
most likely is an application logic problem.  It's certainly not
autovac's fault.
        regards, tom lane


Re: autovacuum locks

От
"Kevin Grittner"
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So the real problem is slow removal of prepared transactions,
> which most likely is an application logic problem.  It's certainly
> not autovac's fault.
Yeah, I've seen way too much Java code lately which fails to close
ResultSet or Statement (which includes PreparedStatement) objects,
leaving it to the vagaries of the Java garbage collector to close
these objects right before freeing their memory to the Java heap. 
People should never count on the finalize() method to do this for
them -- the close() method should be invoked as soon as the object
is no longer needed, and definitely before it goes out of scope.
-Kevin


Re: autovacuum locks

От
Greg Jaskiewicz
Дата:
On 2 Mar 2012, at 15:28, Robert Haas wrote:

> On Fri, Mar 2, 2012 at 6:22 AM, Gregg Jaskiewicz <gryzman@gmail.com> wrote:
>> Looking at the system bit more now, it look like 'waiting' states are
>> changing for both the query and autovacuum in pg_stat_activity.
>> But very slowly. It looks like they both got into that sort of state
>> that keeps them on the edge of starvation.
>> 
>> So this isn't really a deadlock, but rather very poor performance in
>> this corner case.
> 
> Are you making any use of cursors?

nope. 



Re: autovacuum locks

От
Gregg Jaskiewicz
Дата:
Tom, agreed - it looks like we dug the hole and got ourselves into it.
But I still want to understand why.

It looks like we have rather small table on the host where I see the
slowness. And all other tables have triggers that will update one row
in that small table. The small table contains single entry per table.
The thing is, when I scan pg_locks - I can pretty much see everything
waiting for lock to access that table. To grab pg_lock output, I'm
using this view:

SELECT   waiting.locktype           AS waiting_locktype,   waiting.relation::regclass AS waiting_table,
waiting_stm.current_query AS waiting_query,   waiting.mode               AS waiting_mode,   waiting.pid
ASwaiting_pid,   other.locktype             AS other_locktype,   other.relation::regclass   AS other_table,
other_stm.current_query   AS other_query,   other.mode                 AS other_mode,   other.pid                  AS
other_pid,  other.granted              AS other_granted
 
FROM   pg_catalog.pg_locks AS waiting
JOIN   pg_catalog.pg_stat_activity AS waiting_stm   ON (       waiting_stm.procpid = waiting.pid   )
JOIN   pg_catalog.pg_locks AS other   ON (       (           waiting."database" = other."database"       AND
waiting.relation = other.relation       )       OR waiting.transactionid = other.transactionid   )
 
JOIN   pg_catalog.pg_stat_activity AS other_stm   ON (       other_stm.procpid = other.pid   )
WHERE   NOT waiting.granted
AND   waiting.pid <> other.pid AND other_stm.query_start < now() -
interval '14 hours' AND other_stm.current_query NOT LIKE '<IDLE>';

And yes, some updates are there for longer then 14 hours.

Now, there's two of those queries in particular - both updating just a
single row. Stuck for over 14 hours (2 days now actually).
I simply cannot believe that single table in the middle of things will
lock stuff up so much.

Also, on the subject of prepared transactions (2PC), the "select *
from pg_prepared_xacts ;" query simply does not reveal anything,
despite the fact that I know that there should be at least two of
those open.
Unless it only list saved transactions, not a transaction in the
middle of operation.

I need these 2PC transactions, in order to achieve something close to
multi-master replication.
But what I think I'll target first, is the triggers updating that
single table on my 'main master'. Unless you guys can suggest
something better.


Re: autovacuum locks

От
"Kevin Grittner"
Дата:
Gregg Jaskiewicz  wrote:
> Also, on the subject of prepared transactions (2PC), the "select *
> from pg_prepared_xacts ;" query simply does not reveal anything,
> despite the fact that I know that there should be at least two of
> those open.
> Unless it only list saved transactions, not a transaction in the
> middle of operation.
Transactions only appear there between the PREPARE statement and the
COMMIT.  Hopefully that is a small window of time for each
transaction.
-Kevin