Обсуждение: Query cancellation on hot standby because of buffer pins

Поиск
Список
Период
Сортировка

Query cancellation on hot standby because of buffer pins

От
Drazen Kacar
Дата:
I'm seeing query cancellation on hot standby (Postgres 9.3.4, configured with hot_standby_feedback=on) because of buffer pins being held for too long.

Database message looks like this:

ERROR:  canceling statement due to conflict with recovery
DETAIL:  User was holding shared buffer pin for too long.
STATEMENT: ...

I don't understand what kind of activity (either on primary or on standby) can cause this. The documentation about cancellation on standby lists access exclusive locks, vacuum and drops of tablespace or database on the primary as possible causes for cancellation on the standby.

I'm sure that neither of those things was happening on the primary when the query was cancelled on standby. I also don't know what "too long" means? That is, what's the timeout? I've looked at the source a bit and it seems that the timeout is the value of deadlock_timeout parametar, but I'm not very sure about that.

So can somebody explain why the query needs to be cancelled if it pins the buffer for too long and what exactly "too long" is?

The only thing I managed to find on the net is:
http://t61224.db-postgresql-bugs.dbtalk.info/bug-7546-backups-on-hot-standby-cancelled-despite-hot-standby-on-t61224.html

But that doesn't explain why this kind of cancellation happens.

Re: Query cancellation on hot standby because of buffer pins

От
Kevin Grittner
Дата:
Drazen Kacar <drazen.kacar@oradian.com> wrote:

> I'm seeing query cancellation on hot standby (Postgres 9.3.4,
> configured with hot_standby_feedback=on) because of buffer pins
> being held for too long.
>
> Database message looks like this:
>
> ERROR: canceling statement due to conflict with recovery
> DETAIL: User was holding shared buffer pin for too long.
> STATEMENT: ...

It would be helpful to see what that statement was that you
replaced with the ellipsis.

I you don't already have log_autovacuum_min_duration set to 0 (or
something very close to it) on the primary, I recommend you set
that; it would be useful to see if this corresponds to a vacuum on
the table.

What is your maintenance_work_mem setting on the primary? Does
increasing it change the behavior? (You may want to jack that up
really high for diagnostic purposes.)

If you can associate the problem with a particular table, seeing
the definition of that table wouldn't hurt, either.


Oh, and please update to 9.3.6; there are some nasty bugs fixed.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Query cancellation on hot standby because of buffer pins

От
Drazen Kacar
Дата:


On 20 February 2015 at 23:48, Kevin Grittner <kgrittn@ymail.com> wrote:
Drazen Kacar <drazen.kacar@oradian.com> wrote:

> I'm seeing query cancellation on hot standby (Postgres 9.3.4,
> configured with hot_standby_feedback=on) because of buffer pins
> being held for too long.
>
> Database message looks like this:
>
> ERROR: canceling statement due to conflict with recovery
> DETAIL: User was holding shared buffer pin for too long.
> STATEMENT: ...

It would be helpful to see what that statement was that you
replaced with the ellipsis.

It's largish:

              SELECT
                products.name                                         AS product,
                null                                                  AS branch_name,
                null                                                  AS credit_officer_name,
                COUNT(DISTINCT account_id)                            AS accounts,
                COUNT(DISTINCT client_id)                             AS clients,
                SUM(balance)                                          AS balance,
                ROUND((SUM(balance) / COUNT(DISTINCT account_id)), 2) AS averageBalance
              FROM deposit_accounts
                JOIN accounts ON deposit_accounts.account_id = accounts.id
                JOIN clients ON accounts.client_id = clients.id
                JOIN products ON accounts.product_id = products.id
                JOIN organisation_structure_nodes ON (clients.organisation_structure_node_id = organisation_structure_nodes.id)
              WHERE organisation_structure_nodes.tree_path <@ (SELECT tree_path FROM organisation_structure_nodes WHERE organisation_structure_nodes.id = $1)
                    AND organisation_structure_nodes.tree_path <@ (SELECT tree_path FROM user_top_visible_node_tree_paths_view WHERE user_id = $2)
                    AND accounts.status_id = 3 OR accounts.status_id = 4
              GROUP BY products.id
              ORDER BY product

 

I you don't already have log_autovacuum_min_duration set to 0 (or
something very close to it) on the primary, I recommend you set
that; it would be useful to see if this corresponds to a vacuum on
the table.

It is already at zero. There was no vacuuming on the primary around that time. I've seen some other cancellations which were probably due to the vacuuming on the primary (even though hot_standby_feedback is set to on), but then the error on the standby looked like:

DETAIL: User was holding a relation lock for too long.

I'm not sure why that happened either, but I suppose it has a different cause. At the time they happened on the standby there was vacuuming of one table participating in the select query on the primary.


What is your maintenance_work_mem setting on the primary? Does
increasing it change the behavior? (You may want to jack that up
really high for diagnostic purposes.)

It's set to 400MB on the primary. These cancellations happened on the production system and they are very rare (maybe once or twice a week) and I don't have a way to reproduce them. So I don't think I have a way to correlate changing of the configuration parameters to query cancellation.

BTW, above mentioned cancellation because of the relation lock and vacuuming on the primary also happens very rarely and I don't know how to reproduce that either.


If you can associate the problem with a particular table, seeing
the definition of that table wouldn't hurt, either.

I don't think I can associate the problem with one table in the above query.
 


Oh, and please update to 9.3.6; there are some nasty bugs fixed.


It's scheduled for today. :-)
 

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Query cancellation on hot standby because of buffer pins

От
Simon Riggs
Дата:
On 23 February 2015 at 08:19, Drazen Kacar <drazen.kacar@oradian.com> wrote:

> At the time they happened on the standby there was vacuuming of one
> table participating in the select query on the primary.

The VACUUM will have generated a WAL record that needs super exclusive
access to the block. Since feedback was enabled that record would not
have removed data visible by the query, but still needs to edit the
block.

The query was pinning that block, so this situation led to a delay on
the standby, which then led to cancellation of the query.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services


Re: Query cancellation on hot standby because of buffer pins

От
Drazen Kacar
Дата:


On 23 February 2015 at 11:12, Simon Riggs <simon@2ndquadrant.com> wrote:
On 23 February 2015 at 08:19, Drazen Kacar <drazen.kacar@oradian.com> wrote:

> At the time they happened on the standby there was vacuuming of one
> table participating in the select query on the primary.

The VACUUM will have generated a WAL record that needs super exclusive
access to the block. Since feedback was enabled that record would not
have removed data visible by the query, but still needs to edit the
block.

The query was pinning that block, so this situation led to a delay on
the standby, which then led to cancellation of the query.

OK, that makes sense.

So if I increase max_standby_archive_delay (or set it to infinite) that shouldn't happen?

My problem with max_standby_archive_delay in this case is that I don't understand relative priorities between obtaining a lock by queries and obtaining a lock by vacuum (or other applications of WAL records).

If the first query obtains a lock that the vacuum needs and I have sufficiently large max_standby_archive delay, I suppose the vacuum will wait and the query won't be cancelled. What happens if another query that needs a lock on the same table comes in while vacuum is waiting? Is there a way to guarantee that the vacuum (which blocks application of subsequent WAL records, I assume) will be the first one to get the lock? If not, then (with max_standby_archive_delay=-1) it's possible that the application of WAL records could wait indefinitely.

OTOH, if max_standby_archive_delay is finite and there is no way to guarantee that vacuum gets its lock before other queries, then there's no way to avoid query cancellation.

Am I getting something wrong here?


Re: Query cancellation on hot standby because of buffer pins

От
Simon Riggs
Дата:
On 23 February 2015 at 10:35, Drazen Kacar <drazen.kacar@oradian.com> wrote:
>
>
> On 23 February 2015 at 11:12, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 23 February 2015 at 08:19, Drazen Kacar <drazen.kacar@oradian.com>
>> wrote:
>>
>> > At the time they happened on the standby there was vacuuming of one
>> > table participating in the select query on the primary.
>>
>> The VACUUM will have generated a WAL record that needs super exclusive
>> access to the block. Since feedback was enabled that record would not
>> have removed data visible by the query, but still needs to edit the
>> block.
>>
>> The query was pinning that block, so this situation led to a delay on
>> the standby, which then led to cancellation of the query.
>
>
> OK, that makes sense.
>
> So if I increase max_standby_archive_delay (or set it to infinite) that
> shouldn't happen?
>
> My problem with max_standby_archive_delay in this case is that I don't
> understand relative priorities between obtaining a lock by queries and
> obtaining a lock by vacuum (or other applications of WAL records).
>
> If the first query obtains a lock that the vacuum needs and I have
> sufficiently large max_standby_archive delay, I suppose the vacuum will wait
> and the query won't be cancelled. What happens if another query that needs a
> lock on the same table comes in while vacuum is waiting? Is there a way to
> guarantee that the vacuum (which blocks application of subsequent WAL
> records, I assume) will be the first one to get the lock? If not, then (with
> max_standby_archive_delay=-1) it's possible that the application of WAL
> records could wait indefinitely.
>
> OTOH, if max_standby_archive_delay is finite and there is no way to
> guarantee that vacuum gets its lock before other queries, then there's no
> way to avoid query cancellation.
>
> Am I getting something wrong here?

Yes, you are confusing block and relation level locks. The contention
is at block level.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services