Обсуждение: Timing out A Blocker Based on Time or Count of Waiters

Поиск
Список
Период
Сортировка

Timing out A Blocker Based on Time or Count of Waiters

От
Fred Habash
Дата:
Facing an issue where sometimes humans login to a database and run DDL statements causing a long locking tree of over 1000 waiters. As a workaround, we asked developers to always start their DDL sessions with 'SET lock_timeout = 'Xs'. 

I reviewed the native lock timeout parameter in Postgres and found 7. None seem to be related to blocker timeouts directly.

idle_in_transaction_session_timeout
idle_session_timeout
lock_timeout: How long a session waits for a lock
statement_timeout
authentication_timeout
deadlock_timeout
log_lock_waits

Instead, I put together a quick procedure that counts waiter sessions for a given blocker and terminates it if waiter count exceeds a threshold.

Is there not a native way to ...
1. Automatically time out a blocker
2. A metric that shows how many waiters for a blocker?

Thanks
--

----------------------------------------
Thank you


Re: Timing out A Blocker Based on Time or Count of Waiters

От
Adrian Klaver
Дата:
On 3/22/24 09:25, Fred Habash wrote:
> Facing an issue where sometimes humans login to a database and run DDL 
> statements causing a long locking tree of over 1000 waiters. As a 

The above needs more explanation:

1) Define locking tree.

2) Define waiters.

3) Provide examples of the DDL.


> workaround, we asked developers to always start their DDL sessions 
> with 'SET lock_timeout = 'Xs'.
> 
> I reviewed the native lock timeout parameter in Postgres and found 7. 
> None seem to be related to blocker timeouts directly.
> 
> idle_in_transaction_session_timeout
> idle_session_timeout
> lock_timeout: How long a session waits for a lock
> statement_timeout
> authentication_timeout
> deadlock_timeout
> log_lock_waits
> 
> Instead, I put together a quick procedure that counts waiter sessions 
> for a given blocker and terminates it if waiter count exceeds a threshold.
> 
> Is there not a native way to ...
> 1. Automatically time out a blocker
> 2. A metric that shows how many waiters for a blocker?
> 
> Thanks
> -- 
> 
> ----------------------------------------
> Thank you
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Timing out A Blocker Based on Time or Count of Waiters

От
Christophe Pettus
Дата:

> On Mar 22, 2024, at 09:25, Fred Habash <fmhabash@gmail.com> wrote:
>
> Facing an issue where sometimes humans login to a database and run DDL statements causing a long locking tree of over
1000waiters. As a workaround, we asked developers to always start their DDL sessions with 'SET lock_timeout = 'Xs'.  
>
> I reviewed the native lock timeout parameter in Postgres and found 7. None seem to be related to blocker timeouts
directly.

"Blocker" isn't totally clear to me, but assuming you mean, "Is there a way of terminating a transaction that is
holdinga lock on which other sessions are waiting after a certain amount of time, even if the session is actively
runningqueries?", the answer is no, there's no specific setting in PostgreSQL that does that automatically. 

The most apropos setting is `idle_in_transaction_session_timeout`, but that will not terminate a session that is
activelyrunning a query.  A combination of `idle_in_transaction_session_timeout` and `statement_timeout` will get you
veryclose to it, however.  That won't catch a session that is running queries that are less take less than
`statement_timeout`to complete, and don't wait more than `idle_in_transaction_session_timeout` to issue a new query.
Itdoes not also discriminate between transactions that are holding locks on which other sessions are waiting, and ones
thataren't. 

You could write a polling script that checks pg_stat_activity and pg_locks, and terminates transactions that have been
runninglonger than x seconds, and which are holding locks that other sessions are waiting on.  I'm not sure that's
reallya recommended course of action, though, as scripts like that can often kill things you didn't really mean them
to.


Re: Timing out A Blocker Based on Time or Count of Waiters

От
Fred Habash
Дата:
Lock tree: All PID's waiting on a lock held by/blocked by single blocker PID. Similar to what you see in the output of this script: https://github.com/dataegret/pg-utils/blob/master/sql/locktree.sql . It uses the dot connotation to draw a tree. 

Waiters: The PID (first column) returned by this query, for example 

SELECT
    activity.pid,
    activity.usename,
    activity.query,
    blocking.pid AS blocking_id,
    blocking.query AS blocking_query
FROM pg_stat_activity AS activity
JOIN pg_stat_activity AS blocking ON blocking.pid = ANY(pg_blocking_pids(activity.pid));

DDL example: An 'alter table ... alter column ...' would cause all DML and SELECT statements to wait/block. 

Hope this answers your question. Thanks for your interest. 

On Fri, Mar 22, 2024 at 12:32 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 3/22/24 09:25, Fred Habash wrote:
> Facing an issue where sometimes humans login to a database and run DDL
> statements causing a long locking tree of over 1000 waiters. As a

The above needs more explanation:

1) Define locking tree.

2) Define waiters.

3) Provide examples of the DDL.


> workaround, we asked developers to always start their DDL sessions
> with 'SET lock_timeout = 'Xs'.
>
> I reviewed the native lock timeout parameter in Postgres and found 7.
> None seem to be related to blocker timeouts directly.
>
> idle_in_transaction_session_timeout
> idle_session_timeout
> lock_timeout: How long a session waits for a lock
> statement_timeout
> authentication_timeout
> deadlock_timeout
> log_lock_waits
>
> Instead, I put together a quick procedure that counts waiter sessions
> for a given blocker and terminates it if waiter count exceeds a threshold.
>
> Is there not a native way to ...
> 1. Automatically time out a blocker
> 2. A metric that shows how many waiters for a blocker?
>
> Thanks
> --
>
> ----------------------------------------
> Thank you
>
>

--
Adrian Klaver
adrian.klaver@aklaver.com



--

----------------------------------------
Thank you


Re: Timing out A Blocker Based on Time or Count of Waiters

От
Adrian Klaver
Дата:
On 3/22/24 12:41, Fred Habash wrote:
> Lock tree: All PID's waiting on a lock held by/blocked by single blocker 
> PID. Similar to what you see in the output of this script: 
> https://github.com/dataegret/pg-utils/blob/master/sql/locktree.sql 
> <https://github.com/dataegret/pg-utils/blob/master/sql/locktree.sql> . 
> It uses the dot connotation to draw a tree.
> 
> Waiters: The PID (first column) returned by this query, for example
> 
> SELECT
>      activity.pid,
>      activity.usename,
>      activity.query,
>      blocking.pid AS blocking_id,
>      blocking.query AS blocking_query
> FROM pg_stat_activity AS activity
> JOIN pg_stat_activity AS blocking ON blocking.pid = 
> ANY(pg_blocking_pids(activity.pid));
> 
> DDL example: An 'alter table ... alter column ...' would cause all DML 
> and SELECT statements to wait/block.
> 
> Hope this answers your question. Thanks for your interest.

Yes.

To me the issue is,

"Facing an issue where sometimes humans login to a database and run DDL 
statements causing a long locking tree of over 1000 waiters."

where the problem is people running disruptive statements without regard 
to planning or what else is happening on the database. I am not sure 
that dropping a statement just based on a count is progress. If the DDL 
is important then it needs to be run at some point and you are 
conceivably back at the same blocking issue. This then leads to two 
possibilities either the DDL is not important and shouldn't be run or it 
is and some thought and timing needs to be applied before it is run.

> 
> On Fri, Mar 22, 2024 at 12:32 PM Adrian Klaver 
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
> 
>     On 3/22/24 09:25, Fred Habash wrote:
>      > Facing an issue where sometimes humans login to a database and
>     run DDL
>      > statements causing a long locking tree of over 1000 waiters. As a
> 
>     The above needs more explanation:
> 
>     1) Define locking tree.
> 
>     2) Define waiters.
> 
>     3) Provide examples of the DDL.
> 
> 
>      > workaround, we asked developers to always start their DDL sessions
>      > with 'SET lock_timeout = 'Xs'.
>      >
>      > I reviewed the native lock timeout parameter in Postgres and
>     found 7.
>      > None seem to be related to blocker timeouts directly.
>      >
>      > idle_in_transaction_session_timeout
>      > idle_session_timeout
>      > lock_timeout: How long a session waits for a lock
>      > statement_timeout
>      > authentication_timeout
>      > deadlock_timeout
>      > log_lock_waits
>      >
>      > Instead, I put together a quick procedure that counts waiter
>     sessions
>      > for a given blocker and terminates it if waiter count exceeds a
>     threshold.
>      >
>      > Is there not a native way to ...
>      > 1. Automatically time out a blocker
>      > 2. A metric that shows how many waiters for a blocker?
>      >
>      > Thanks
>      > --
>      >
>      > ----------------------------------------
>      > Thank you
>      >
>      >
> 
>     -- 
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 
> 
> 
> -- 
> 
> ----------------------------------------
> Thank you
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Timing out A Blocker Based on Time or Count of Waiters

От
Merlin Moncure
Дата:
On Fri, Mar 22, 2024 at 11:25 AM Fred Habash <fmhabash@gmail.com> wrote:
Facing an issue where sometimes humans login to a database and run DDL statements causing a long locking tree of over 1000 waiters. As a workaround, we asked developers to always start their DDL sessions with 'SET lock_timeout = 'Xs'. 

I reviewed the native lock timeout parameter in Postgres and found 7. None seem to be related to blocker timeouts directly.

idle_in_transaction_session_timeout
idle_session_timeout
lock_timeout: How long a session waits for a lock
statement_timeout
authentication_timeout
deadlock_timeout
log_lock_waits

Instead, I put together a quick procedure that counts waiter sessions for a given blocker and terminates it if waiter count exceeds a threshold.

Is there not a native way to ...
1. Automatically time out a blocker
2. A metric that shows how many waiters for a blocker?


I guess this probably does not belong in the native codebase because in most real world scenarios with contention you would end up with priority inversion or a situation where no work gets done.  With current locking rules, theoretically the work queue would always clear (assuming the locker doesn't hold the transaction indefinitely), where with your setting enabled it might not always assume the locker retries.

In your case, a hand written 'unblocker' script might be the way to go, or (probably better) encourage patterns where critical tables are not blocked, say by building up a scratch table and swapping in on a separate transaction.   Reducing contention rather than mitigating the symptoms of it, is *always* a good thing.

merlin