Обсуждение: BUG #4730: Vacuum full verbose analyze "deadlock"

Поиск
Список
Период
Сортировка

BUG #4730: Vacuum full verbose analyze "deadlock"

От
"Wayne Conrad"
Дата:
The following bug has been logged online:

Bug reference:      4730
Logged by:          Wayne Conrad
Email address:      wconrad@yagni.com
PostgreSQL version: 8.3.6
Operating system:   Debian 4.3.1-1
Description:        Vacuum full verbose analyze "deadlock"
Details:

"VACUUM FULL ANALYZE VERBOSE" on a "deadlocks" with other SQL statements.
No deadlock detection messages appear in the Postgres log, which is why I
have put "deadlock" in quotes: It's a deadlock in the computer science
meaning of the word, but perhaps not in the Postgres internal meaning of the
word.  Specifically, the VACUUM appears to be waiting on SQL statements
which are in turn waiting on the VACUUM.  That condition continues
indefinitely until the VACUUM is manually terminated (a control-C in psql),
at which time all locks appear to be released and processing continues
normally.

A query of pg_locks shows the condition.  We've written a program that
queries pg_locks and shows who is waiting on whom.   here is a snippet of
its output for one of these incidents.  If a raw query of that table (or
some other) would be better, please let me know what that query should be.

----------

Lock:

locktype=relation relation=jobs
granted=false transaction=4/900 pid=10247 mode=AccessExclusiveLock
query = vacuum full verbose analyze;

Dependants of that lock:

locktype=relation relation=jobs
granted=false transaction=22/654227 pid=3321 mode=AccessShareLock
query = select id, to_char(time_stamp, 'YYYY-MM-DD HH24:MI:SS') as
time_stamp, injectd_log.log_pid as log_pid, injectd_host, injectd_pid, line
from injectd_log where log_pid in (select jobs.pid from jobs inner join
uploads using (upload_id) where account_id = E'1000-000' union (select pid
from uploads where account_id = E'1000-000')) and id > 27398675 order by id

locktype=relation relation=jobs
granted=false transaction=15/1241550 pid=32466 mode=AccessShareLock
query = select w.id as id, w.created_at as created_at, w.status as status,
w.timer as timer, w.error as error, w.priority as priority, w.work_type_id
as work_type_id, w.upload_id as upload_id, w.job_id as job_id,
w.batch_command_id as batch_command_id, w.mail_id as mail_id,
w.print_queue_id as print_queue_id from work_queue_available inner join
work_queue_data w on (work_queue_available.id = w.id) left outer join
batch_commands on (batch_command_id = batch_commands.id) left outer join
mail on (mail_id = mail.id) left outer join print_queue on (print_queue_id =
print_queue.id) left outer join uploads on (w.upload_id = uploads.upload_id)
left outer join jobs on (w.job_id = jobs.job_id) order by w.priority desc,
case when w.work_type_id = (select id from work_types where name = E'cron')
then 1 when w.work_type_id = (select id from work_types where name =
E'email') then 2 else 3 end, coalesce(batch_commands.created_at,
mail.created_at, print_queue.created_at, uploads.log_time, jobs.start_time),
id limit 28

locktype=relation relation=jobs
granted=true transaction=7/1931437 pid=27584 mode=AccessShareLock
query = <IDLE> in transaction

locktype=relation relation=jobs
granted=true transaction=8/2113080 pid=31961 mode=AccessShareLock
query = <IDLE> in transaction

----------

Lock:

locktype=relation relation=jobs
granted=false transaction=22/654227 pid=3321 mode=AccessShareLock
query = select id, to_char(time_stamp, 'YYYY-MM-DD HH24:MI:SS') as
time_stamp, injectd_log.log_pid as log_pid, injectd_host, injectd_pid, line
from injectd_log where log_pid in (select jobs.pid from jobs inner join
uploads using (upload_id) where account_id = E'1000-000' union (select pid
from uploads where account_id = E'1000-000')) and id > 27398675 order by id

Dependants of that lock:

locktype=relation relation=jobs
granted=false transaction=4/900 pid=10247 mode=AccessExclusiveLock
query = vacuum full verbose analyze;

locktype=relation relation=jobs
granted=false transaction=15/1241550 pid=32466 mode=AccessShareLock
query = select w.id as id, w.created_at as created_at, w.status as status,
w.timer as timer, w.error as error, w.priority as priority, w.work_type_id
as work_type_id, w.upload_id as upload_id, w.job_id as job_id,
w.batch_command_id as batch_command_id, w.mail_id as mail_id,
w.print_queue_id as print_queue_id from work_queue_available inner join
work_queue_data w on (work_queue_available.id = w.id) left outer join
batch_commands on (batch_command_id = batch_commands.id) left outer join
mail on (mail_id = mail.id) left outer join print_queue on (print_queue_id =
print_queue.id) left outer join uploads on (w.upload_id = uploads.upload_id)
left outer join jobs on (w.job_id = jobs.job_id) order by w.priority desc,
case when w.work_type_id = (select id from work_types where name = E'cron')
then 1 when w.work_type_id = (select id from work_types where name =
E'email') then 2 else 3 end, coalesce(batch_commands.created_at,
mail.created_at, print_queue.created_at, uploads.log_time, jobs.start_time),
id limit 28

locktype=relation relation=jobs
granted=true transaction=7/1931437 pid=27584 mode=AccessShareLock
query = <IDLE> in transaction

locktype=relation relation=jobs
granted=true transaction=8/2113080 pid=31961 mode=AccessShareLock
query = <IDLE> in transaction

----------

Lock:

locktype=relation relation=jobs
granted=false transaction=15/1241550 pid=32466 mode=AccessShareLock
query = select w.id as id, w.created_at as created_at, w.status as status,
w.timer as timer, w.error as error, w.priority as priority, w.work_type_id
as work_type_id, w.upload_id as upload_id, w.job_id as job_id,
w.batch_command_id as batch_command_id, w.mail_id as mail_id,
w.print_queue_id as print_queue_id from work_queue_available inner join
work_queue_data w on (work_queue_available.id = w.id) left outer join
batch_commands on (batch_command_id = batch_commands.id) left outer join
mail on (mail_id = mail.id) left outer join print_queue on (print_queue_id =
print_queue.id) left outer join uploads on (w.upload_id = uploads.upload_id)
left outer join jobs on (w.job_id = jobs.job_id) order by w.priority desc,
case when w.work_type_id = (select id from work_types where name = E'cron')
then 1 when w.work_type_id = (select id from work_types where name =
E'email') then 2 else 3 end, coalesce(batch_commands.created_at,
mail.created_at, print_queue.created_at, uploads.log_time, jobs.start_time),
id limit 28

Dependants of that lock:

locktype=relation relation=jobs
granted=false transaction=4/900 pid=10247 mode=AccessExclusiveLock
query = vacuum full verbose analyze;

locktype=relation relation=jobs
granted=false transaction=22/654227 pid=3321 mode=AccessShareLock
query = select id, to_char(time_stamp, 'YYYY-MM-DD HH24:MI:SS') as
time_stamp, injectd_log.log_pid as log_pid, injectd_host, injectd_pid, line
from injectd_log where log_pid in (select jobs.pid from jobs inner join
uploads using (upload_id) where account_id = E'1000-000' union (select pid
from uploads where account_id = E'1000-000')) and id > 27398675 order by id

locktype=relation relation=jobs
granted=true transaction=7/1931437 pid=27584 mode=AccessShareLock
query = <IDLE> in transaction

locktype=relation relation=jobs
granted=true transaction=8/2113080 pid=31961 mode=AccessShareLock
query = <IDLE> in transaction

To summarize the above output: there are three statements, each waiting on
the other two:

transaction 4/900: VACUUM FULL VERBOSE ANALYZE
transaction 22/654277: SELECT ... FROM INJECTD_LOG ...
transaction 15/1241550: SELECT ... FROM WORK_QUEUE_AVAILABLE ...

A "VACUUM VERBOSE ANALYZE" (without the "FULL") does not manifest this
anomaly.  Furthermore, after doing a partial "VACUUM VERBOSE ANALYZE" (that
is, aborting it after it has gotten finished with some tables, but before it
has done all of them), a "VACUUM FULL VERBOSE ANALYZE" gets farther (though
more tables) before the lockup occurs.  Specifically, the "VACUUM FULL
VERBOSE ANALYZE" does not get locked up until it hits a table that the
"VACUUM VERBOSE ANALYZE" did not get to.

Sadly, I have only made this happen in our production environment (very
active, very large set of data).  No tidy test case wrapped up in a tarball
with ribbons and bows.  I am submitting this bug report despite the clear
instructions to only submit bug reports against the latest minor version,
since I'm currently at the end of my rope (until I manage to reproduce this
in a non-production environment).  I am hoping I can be forgiven for being
one minor version behind.  If not, let me know--I've got a nice corner right
here to go stand in.

Re: BUG #4730: Vacuum full verbose analyze "deadlock"

От
"Kevin Grittner"
Дата:
>>> "Wayne Conrad" <wconrad@yagni.com> wrote:

> "VACUUM FULL ANALYZE VERBOSE" on a "deadlocks"

> "VACUUM VERBOSE ANALYZE" (without the "FULL") does not

You do realize that FULL should not be part of normal maintenance,
right?  It is sometimes useful to recover from table bloat when normal
maintenance fails.  Although it is almost always much slower than
CLUSTER, it has the advantage of not requiring disk space for a second
copy of the table, but it requires a REINDEX afterward to correct the
index bloat it causes.  If you are doing a good job of normal
maintenance, you never, ever should be running VACUUM FULL.

None of the above means you haven't found a problem worth looking at
-- I'm not trying to comment on that; but unless you are in the middle
of recovery from abnormal bloat, you may be able to dodge the problem
by correcting your maintenance practices.

-Kevin

Re: BUG #4730: Vacuum full verbose analyze "deadlock"

От
Tom Lane
Дата:
"Wayne Conrad" <wconrad@yagni.com> writes:
> "VACUUM FULL ANALYZE VERBOSE" on a "deadlocks" with other SQL statements.

What this looks like to me is an application problem.  In each case
you show, the only backends that actually *have* the lock are
"<IDLE> in transaction", ie they are waiting for their clients to
issue another command or close the transaction.  The other queries
won't be able to make any forward progress until those transactions
complete and release their locks.

> To summarize the above output: there are three statements, each waiting on
> the other two:

No, you're reading it wrong.  The sessions with granted=false are waiting
on the ones with granted=true.

It's conceivable that this actually is a deadlock situation, if the
connected clients are interrelated and are coded so that those other
queries must finish before they will do anything more.  But you'll have
to break such a deadlock on the application side; Postgres has no way to
know what's happening.

            regards, tom lane

Re: BUG #4730: Vacuum full verbose analyze "deadlock"

От
Wayne Conrad
Дата:
On Wed, Mar 25, 2009 at 09:15:48AM -0500, Kevin Grittner wrote:
> >>> "Wayne Conrad" <wconrad@yagni.com> wrote:
>
> > "VACUUM FULL ANALYZE VERBOSE" on a "deadlocks"
>
> > "VACUUM VERBOSE ANALYZE" (without the "FULL") does not
>
> You do realize that FULL should not be part of normal maintenance,
> right?  It is sometimes useful to recover from table bloat when normal
> maintenance fails.  Although it is almost always much slower than
> CLUSTER, it has the advantage of not requiring disk space for a second
> copy of the table, but it requires a REINDEX afterward to correct the
> index bloat it causes.  If you are doing a good job of normal
> maintenance, you never, ever should be running VACUUM FULL.
>
> None of the above means you haven't found a problem worth looking at
> -- I'm not trying to comment on that; but unless you are in the middle
> of recovery from abnormal bloat, you may be able to dodge the problem
> by correcting your maintenance practices.
>
> -Kevin

Kevin,

We started doing a routine VACUUM FULL ANALYZE on the advice of a
friend after the database started getting slow over time.  We did not
realize that the FULL should never be done.  We will change that to a
VACUUM ANALYZE, without the full, and see how it goes.

We discovered that an occasionally REINDEX was also required to keep
things fast.  We did not realize it was a direct result of the VACUUM
FULL that the REINDEX was required.

Thank you for your advice.  We are, obviously, not DB guys--just
programmers shoved into the role.  We push the buttons and pull the
levers until things appear to work, which is a poor substitute for
actually understanding what we are doing.

Best Regards,
  Wayne Conrad

Re: BUG #4730: Vacuum full verbose analyze "deadlock"

От
Wayne Conrad
Дата:
On Wed, Mar 25, 2009 at 11:28:44AM -0400, Tom Lane wrote:
> What this looks like to me is an application problem.  In each case
> you show, the only backends that actually *have* the lock are
> "<IDLE> in transaction", ie they are waiting for their clients to
> issue another command or close the transaction.  The other queries
> won't be able to make any forward progress until those transactions
> complete and release their locks.

Tom,

Thank you.  It's obvious now what a dunderhead I've been in misreading
the output of our own diagnostic tool.  Thank you for the education.
We're going to sidestep the issue, for now, by going with Kevin's
suggestion to ditch the "FULL".  If that doesn't work (such that I've
got to put the FULL back in and reintroduce the lockup), I'll get back
to my application, understand what's got it in knots, and untie it.

I apologize for blaming Postgres.

Best Regards,
        Wayne Conrad

Re: BUG #4730: Vacuum full verbose analyze "deadlock"

От
Alvaro Herrera
Дата:
Wayne Conrad wrote:

> We started doing a routine VACUUM FULL ANALYZE on the advice of a
> friend after the database started getting slow over time.  We did not
> realize that the FULL should never be done.  We will change that to a
> VACUUM ANALYZE, without the full, and see how it goes.
>
> We discovered that an occasionally REINDEX was also required to keep
> things fast.  We did not realize it was a direct result of the VACUUM
> FULL that the REINDEX was required.

You probably need to increase the size of the free space map too.  See
the fsm_max_pages parameter in postgresql.conf.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: BUG #4730: Vacuum full verbose analyze "deadlock"

От
"Kevin Grittner"
Дата:
Wayne Conrad <wconrad@yagni.com> wrote:
> the database started getting slow over time.

As Alvaro pointed out, this can happen if your fsm configuration
doesn't allow enough space for a normal VACUUM to keep track of all
the free space.  Since you're running VACUUM with the VERBOSE option,
be sure to capture the output and review the last few lines; this will
tell you what setting you have and the minimum which would cover
current needs.

Also, be sure you haven't disabled autovacuum along the way.

There are some usage patterns which require special forms of
maintenance, but VACUUM FULL usually isn't the best option for routine
maintenance.  If you notice performance degrading again, please post
details on the performance list.

I hope this helps.

-Kevin