Обсуждение: [GENERAL] Logical decoding client has the power to crash the server

Поиск
Список
Период
Сортировка

[GENERAL] Logical decoding client has the power to crash the server

От
Meel Velliste
Дата:
From what I understand about logical decoding, there is no limit to how many log entries will be retained by the server if nobody reads them from the logical slot. This means that a client that fails to read from the slot has the power to bring down the master database because the server's disk will get full at which point all subsequent write operations will fail and even read operations will fail because they too need temporary space. Even the underlying operating system may be affected as it too may need temporary disk space to carry out its basic functions.

This essentially means that the client has the power to bring down the server without even doing anything malicious, merely by being passive. Seems like it should not be this way.

Is there really no way to limit the retained log entries, for example by specifying the amount of time (e.g. 24 hours or 7 days), so that any entries older than that would be automatically dropped?

Re: [GENERAL] Logical decoding client has the power to crash the server

От
Michael Paquier
Дата:
On Wed, Sep 20, 2017 at 3:14 PM, Meel Velliste <meel@fivetran.com> wrote:
> From what I understand about logical decoding, there is no limit to how many
> log entries will be retained by the server if nobody reads them from the
> logical slot. This means that a client that fails to read from the slot has
> the power to bring down the master database because the server's disk will
> get full at which point all subsequent write operations will fail and even
> read operations will fail because they too need temporary space. Even the
> underlying operating system may be affected as it too may need temporary
> disk space to carry out its basic functions.

Monitoring is a mandatory part of the handling of replication slots.
One possible solution is to use a background worker that scans slots
causing bloat in pg_xlog and to automatically get rid of them so as
the primary is preserved from any crash. Note that advancing a slot is
doable for a physical slot, but advancing a logical slot is trickier
(not sure if that's doable actually but Andres can comment on that)
because it involves being sure that the catalog_xmin is still
preserved so as past logical changes can be looked at consistently.
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Logical decoding client has the power to crash the server

От
Meel Velliste
Дата:
Hi Michael,

Thank you, I appreciate your response. Now that you mention, I am realizing that I don't really care about dropping the oldest log entries. Mandatory monitoring makes a lot of sense and dropping the entire slot would be perfect when it consumes too much space.

The only problem with monitoring is that I may have no control over it. My use case is complicated by the fact that there are three parties:
1) Our customer who has admin privileges on the database
2) Us with limited privileges
3) The database hosting provider who restricts access to the underlying OS and file system

In this situation, neither us, nor our customer has the power to install the required monitoring of pg_xlog. The database hosting provider would have to do it. In most cases (e.g. Amazon RDS) the hosting provider does provide a way of monitoring overall disk usage, which may be good enough. But I am thinking it would make sense for postgres to have default, built-in monitoring that drops all the slots when pg_xlog gets too full (based on some configurable limit). Otherwise everybody has to build their own monitoring and I imagine 99% of them would want the same behavior. Nobody wants their database to fail just because some client was not reading the slot. 

In our case, if we lose access to the customer's database, if they did not install monitoring (even though we told them to), their disk will fill up and they will blame us for crashing their database. It ends up being a classic case of finger pointing between multiple parties. This has not happened yet but I am sure it is just a matter of time. I would really like to see a default, built-in circuit breaker in postgres to prevent this.

Another bit of context here is that the logical decoding is of secondary importance to our customers, but their postgres database itself is absolutely mission critical.

Thanks,

Meel

On Wed, Sep 20, 2017 at 12:43 AM Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Sep 20, 2017 at 3:14 PM, Meel Velliste <meel@fivetran.com> wrote:
> From what I understand about logical decoding, there is no limit to how many
> log entries will be retained by the server if nobody reads them from the
> logical slot. This means that a client that fails to read from the slot has
> the power to bring down the master database because the server's disk will
> get full at which point all subsequent write operations will fail and even
> read operations will fail because they too need temporary space. Even the
> underlying operating system may be affected as it too may need temporary
> disk space to carry out its basic functions.

Monitoring is a mandatory part of the handling of replication slots.
One possible solution is to use a background worker that scans slots
causing bloat in pg_xlog and to automatically get rid of them so as
the primary is preserved from any crash. Note that advancing a slot is
doable for a physical slot, but advancing a logical slot is trickier
(not sure if that's doable actually but Andres can comment on that)
because it involves being sure that the catalog_xmin is still
preserved so as past logical changes can be looked at consistently.
--
Michael

Re: [GENERAL] Logical decoding client has the power to crash the server

От
Michael Paquier
Дата:
On Thu, Sep 21, 2017 at 1:09 PM, Meel Velliste <meel@fivetran.com> wrote:
> In this situation, neither us, nor our customer has the power to install the
> required monitoring of pg_xlog. The database hosting provider would have to
> do it. In most cases (e.g. Amazon RDS) the hosting provider does provide a
> way of monitoring overall disk usage, which may be good enough. But I am
> thinking it would make sense for postgres to have default, built-in
> monitoring that drops all the slots when pg_xlog gets too full (based on
> some configurable limit). Otherwise everybody has to build their own
> monitoring and I imagine 99% of them would want the same behavior. Nobody
> wants their database to fail just because some client was not reading the
> slot.

(Please avoid top-posting, this breaks the logic of the thread and
this is contrary to the practices of the Postgres mailing lists)

Note that on-disk lookup is not strictly necessary. If you know
max_wal_size, pg_current_wal_lsn (or pg_last_wal_receive_lsn if
working on a standby) and the restart_lsn of the slots that's enough.
If you don't have privileges sufficient to see that, well I guess that
you will need to review the access permissions to your instance.
Postgres 9.6 offers better access control to system functions, so you
could be granted access to just those resources to be fine using a SQL
session.
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Logical decoding client has the power to crash theserver

От
Igor Neyman
Дата:
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Michael Paquier
Sent: Thursday, September 21, 2017 12:33 AM
To: Meel Velliste <meel@fivetran.com>
Cc: PostgreSQL mailing lists <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] Logical decoding client has the power to crash the server

On Thu, Sep 21, 2017 at 1:09 PM, Meel Velliste <meel@fivetran.com> wrote:
> In this situation, neither us, nor our customer has the power to 
> install the required monitoring of pg_xlog. The database hosting 
> provider would have to do it. In most cases (e.g. Amazon RDS) the 
> hosting provider does provide a way of monitoring overall disk usage, 
> which may be good enough. But I am thinking it would make sense for 
> postgres to have default, built-in monitoring that drops all the slots 
> when pg_xlog gets too full (based on some configurable limit). 
> Otherwise everybody has to build their own monitoring and I imagine 
> 99% of them would want the same behavior. Nobody wants their database 
> to fail just because some client was not reading the slot.

(Please avoid top-posting, this breaks the logic of the thread and this is contrary to the practices of the Postgres
mailinglists)
 

Note that on-disk lookup is not strictly necessary. If you know max_wal_size, pg_current_wal_lsn (or
pg_last_wal_receive_lsnif working on a standby) and the restart_lsn of the slots that's enough.
 
If you don't have privileges sufficient to see that, well I guess that you will need to review the access permissions
toyour instance.
 
Postgres 9.6 offers better access control to system functions, so you could be granted access to just those resources
tobe fine using a SQL session.
 
--
Michael

______________________________________________________________________________________________________

I think the difference between pg_current_wal_lsn() and confirmed_flush_lsn form pg_catalog.pg_replication_slots for
specificreplication slot:
 

SELECT (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance             FROM pg_catalog.pg_replication_slots
 WHERE slot_name = '<some_subscibtion_name>';
 
provides a measure in Logical Replication environment of how far did (or did not) Subscriber fell behind Publisher, and
hencesome kind of measure of how much "extra" WALs is stored on the Publisher.
 

Regards,
Igor Neyman


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Logical decoding client has the power to crash the server

От
Michael Paquier
Дата:
On Fri, Sep 22, 2017 at 5:44 AM, Igor Neyman <ineyman@perceptron.com> wrote:
> I think the difference between pg_current_wal_lsn() and confirmed_flush_lsn form pg_catalog.pg_replication_slots for
specificreplication slot:
 
>
> SELECT (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
>        FROM pg_catalog.pg_replication_slots
>        WHERE slot_name = '<some_subscibtion_name>';
>
>  provides a measure in Logical Replication environment of how far did (or did not) Subscriber fell behind Publisher,
andhence some kind of measure of how much "extra" WALs is stored on the Publisher.
 

More or less. I also make use of the value of max_wal_size in
pg_settings to do the decision-making, value set depending on a
fraction of the size of the partition dedicated to pg_xlog. By
applying pg_size_bytes() which is new to 9.6 on top of the value
fetched from pg_settings you can make a direct comparison and decide
if a slot can be dropped or not. Make sure that things are casted to
bigint though.
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general