The dangers of long running open transactions

Поиск
Список
Период
Сортировка
От Wayne Schroeder
Тема The dangers of long running open transactions
Дата
Msg-id 44355FEA.1060702@positivenetworks.net
обсуждение исходный текст
Список pgsql-general
Summary: Long running transaction from an orphaned database connection
caused major slowness on very active and frequently vacuumed tables
because vacuum could not remove rows that were newer than the long
running transaction.

A while ago, I asked for opinions on the effects of long running
transactions on the ability of the system to effectivly vacuum tables.
I got a response from Tom Lane saying that having transactions open for
long periods of time could indeed cause issues with vacuuming.

On one of our database servers, we ended up having this happen -- mind
you that we have monitoring to tell us if we get close to issues in
regards to the actual frozenxids getting too old to ensure vacuuming
happens and do have regular (in fact, multiple times a day) vacuums.
Our problem came from a very active table becoming quite slow to deal
with.  Explain analyze select count(*) on this table showed us that even
though there were less than a thousand rows in this table, the cost in
rows was much higher for the sequential scan.  I don't remember the
exact number, but select count(*) was taking around 20 seconds, and
inserts / deletes were happening to this table on a regular basis.  The
number of rows analyze was exposed to would NOT go down after vacuums of
 this table.

Due to the question I asked the list earlier, I determined that it was
best to kill all the 'idle in transaction' connections (which we thought
were not holding any transactions for more than 60 seconds and cycling
to new xids since we were aware that long running transactions were bad)
and try vacuuming again.  This did the trick, the table was successfully
vacuumed and was 'instant' fast once again.

A review of system monitor logs showed me that one of the 'idle in
transaction' connections did NOT come back (our software will handle
database disconnections and reconnect).  It turns out that a 'worker
thread' in a process on a remote machine had either become stuck / died
during an open transaction and the database connection was leaked since
the process itself did not terminate.  This left a long running
transaction that was never committed / rolled back for a long period of
time, up until it caused massive slowness when vacuums were proving to
be worthless on very active tables.

We have started to monitor for potentially stuck / long running
transactions with the following query:

select pid, transaction, age(transaction) from pg_locks where
transaction is not null order by age(transaction) desc;

This can be changed around obviously for whatever monitoring software
you use / write, but the basic idea is that the maximum age column tells
you 'how long ago in xids' vacuum must preserve data when it vacuums
tables due to a given connection process.  If this number gets big, in
our case, we warn if they get over 20k (we have lots of transactions) so
we can track down and prevent any issues in advance, action should be
taken to prevent database performance problems / at worst, xid roll
around problems if the problem went on for too long.

Wayne

В списке pgsql-general по дате отправления:

Предыдущее
От: Bill Moseley
Дата:
Сообщение: plgpsql and transactions
Следующее
От: MargaretGillon@chromalloy.com
Дата:
Сообщение: Re: Unique constraint or index, case insensitive, on multiple