Обсуждение: Make unlogged table resets detectable

Поиск
Список
Период
Сортировка

Make unlogged table resets detectable

От
Jeff Davis
Дата:
One problem with unlogged tables is that the application has no way to
tell if they were reset, or they just happen to be empty.

This can be a problem with sharding, where you might have different
shards of an unlogged table on different servers. If one server
crashes, you'll be missing only one shard of the data, which may appear
inconsistent. In that case, you'd like the application (or sharding
solution) to be able to detect that one shard was lost, and TRUNCATE
those that remain to get back to a reasonable state.

It would be easy enough for the init fork to have a single page with a
flag set. That way, when the main fork is replaced with the init fork,
other code could detect that a reset happened.

When detected, depending on a GUC, the behavior could be to auto-
truncate it (to get the current silent behavior), or refuse to perform
the operation (except an explicit TRUNCATE), or issue a
warning/log/notice.

The biggest challenge would be: when should we detect that the reset
has happened? There might be a lot of entry points. Another idea would
be to just have a SQL function that the application could call whenever
it needs to know.

Thoughts?

    Jeff Davis





Re: Make unlogged table resets detectable

От
Heikki Linnakangas
Дата:
On 03/06/2021 23:04, Jeff Davis wrote:
> One problem with unlogged tables is that the application has no way to
> tell if they were reset, or they just happen to be empty.
> 
> This can be a problem with sharding, where you might have different
> shards of an unlogged table on different servers. If one server
> crashes, you'll be missing only one shard of the data, which may appear
> inconsistent. In that case, you'd like the application (or sharding
> solution) to be able to detect that one shard was lost, and TRUNCATE
> those that remain to get back to a reasonable state.
> 
> It would be easy enough for the init fork to have a single page with a
> flag set. That way, when the main fork is replaced with the init fork,
> other code could detect that a reset happened.

I'd suggest using a counter rather than a flag. With a flag, if one 
client clears the flag to acknowledge that a truncation happened, others 
might miss it. See also ABA problem.

> When detected, depending on a GUC, the behavior could be to auto-
> truncate it (to get the current silent behavior), or refuse to perform
> the operation (except an explicit TRUNCATE), or issue a
> warning/log/notice.

TRUNCATE isn't quite what happens when an unlogged table is 
re-initialized. It changes the relfilenode, resets stats, and requires a 
more strict lock. So I don't think repurposing TRUNCATE for 
re-initializing a table is a good idea. There's also potential for a 
race condition, if two connections see that a table needs 
re-initialization, and issue "TRUNCATE + INSERT" concurrently. One of 
the INSERTs will be lost.

A warning or notice is easy to miss.

> The biggest challenge would be: when should we detect that the reset
> has happened? There might be a lot of entry points. Another idea would
> be to just have a SQL function that the application could call whenever
> it needs to know.

Yeah, a SQL function to get the current "reset counter" would be nice.

- Heikki



Re: Make unlogged table resets detectable

От
Jeff Davis
Дата:
On Fri, 2021-06-04 at 09:42 +0300, Heikki Linnakangas wrote:
> I'd suggest using a counter rather than a flag. With a flag, if one 
> client clears the flag to acknowledge that a truncation happened,
> others 
> might miss it. See also ABA problem.

This feels like it's getting more complex.

Stepping back, maybe unlogged tables are the wrong level to solve this
problem. We could just have a "crash counter" in pg_control that would
be incremented every time a crash happened (and all unlogged tables are
reset). It might be a number or maybe the LSN of the startup checkpoint
after the most recent crash.

A SQL function could read the value. Perhaps we'd also have a SQL
function to reset it, but I don't see a use case for it.

Then, it's up to the client to check it against a stored value, and
clear/repopulate unlogged tables as necessary.

Regards,
    Jeff Davis





Re: Make unlogged table resets detectable

От
Robert Haas
Дата:
On Fri, Jun 4, 2021 at 8:41 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Stepping back, maybe unlogged tables are the wrong level to solve this
> problem. We could just have a "crash counter" in pg_control that would
> be incremented every time a crash happened (and all unlogged tables are
> reset). It might be a number or maybe the LSN of the startup checkpoint
> after the most recent crash.
>
> A SQL function could read the value. Perhaps we'd also have a SQL
> function to reset it, but I don't see a use case for it.
>
> Then, it's up to the client to check it against a stored value, and
> clear/repopulate unlogged tables as necessary.

I think this would be useful for a variety of purposes. Both being
able to know the last time that it happened and being able to know the
number of times that it happened could be useful, depending on the
scenario. For example, if one of my employer's customers began
complaining about a problem that started happening recently, it would
be useful to be able to establish whether there had also been a crash
recently, and a timestamp or LSN would help a lot. On the other hand,
if we had a counter, we'd probably find out some interesting things,
too. Maybe someone would report that the value of the counter was
surprisingly large. For example, if a customer's pg_control output
showed that the database cluster had performed crash recovery 162438
times, I might have some, err, followup questions.

This is not a vote for or against any specific proposal; it's just a
general statement that I support trying to do something in this area,
and that it feels like anything we do will likely have some value.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make unlogged table resets detectable

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 4, 2021 at 8:41 PM Jeff Davis <pgsql@j-davis.com> wrote:
>> Stepping back, maybe unlogged tables are the wrong level to solve this
>> problem. We could just have a "crash counter" in pg_control that would
>> be incremented every time a crash happened (and all unlogged tables are
>> reset). It might be a number or maybe the LSN of the startup checkpoint
>> after the most recent crash.

> I think this would be useful for a variety of purposes. Both being
> able to know the last time that it happened and being able to know the
> number of times that it happened could be useful, depending on the
> scenario.

+1.  I'd support recording the time of the last crash recovery, as
well as having a counter.  I think an LSN would not be as useful
as a timestamp.

            regards, tom lane



Re: Make unlogged table resets detectable

От
Justin Pryzby
Дата:
On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Jun 4, 2021 at 8:41 PM Jeff Davis <pgsql@j-davis.com> wrote:
> >> Stepping back, maybe unlogged tables are the wrong level to solve this
> >> problem. We could just have a "crash counter" in pg_control that would
> >> be incremented every time a crash happened (and all unlogged tables are
> >> reset). It might be a number or maybe the LSN of the startup checkpoint
> >> after the most recent crash.
> 
> > I think this would be useful for a variety of purposes. Both being
> > able to know the last time that it happened and being able to know the
> > number of times that it happened could be useful, depending on the
> > scenario.
> 
> +1.  I'd support recording the time of the last crash recovery, as
> well as having a counter.  I think an LSN would not be as useful
> as a timestamp.

+1

It's been suggested before ;)
https://www.postgresql.org/message-id/20180228221653.GB32095%40telsasoft.com

PS. I currently monitor for crashes by checking something hacky like:
| SELECT backend_start - pg_postmaster_start_time() ORDER BY 1



Re: Make unlogged table resets detectable

От
Michael Paquier
Дата:
On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote:
> +1.  I'd support recording the time of the last crash recovery, as
> well as having a counter.  I think an LSN would not be as useful
> as a timestamp.

One could guess a timestamp based on a LSN, no?  So I'd like to think
the opposite actually: a LSN would be more useful than a timestamp.
--
Michael

Вложения

Re: Make unlogged table resets detectable

От
Julien Rouhaud
Дата:
On Tue, Jun 08, 2021 at 12:46:05PM +0900, Michael Paquier wrote:
> On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote:
> > +1.  I'd support recording the time of the last crash recovery, as
> > well as having a counter.  I think an LSN would not be as useful
> > as a timestamp.
> 
> One could guess a timestamp based on a LSN, no?  So I'd like to think
> the opposite actually: a LSN would be more useful than a timestamp.

Wouldn't that work only if the LSN is recent enough, depending on the WAL
activity?



Re: Make unlogged table resets detectable

От
Robert Haas
Дата:
On Mon, Jun 7, 2021 at 11:46 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote:
> > +1.  I'd support recording the time of the last crash recovery, as
> > well as having a counter.  I think an LSN would not be as useful
> > as a timestamp.
>
> One could guess a timestamp based on a LSN, no?  So I'd like to think
> the opposite actually: a LSN would be more useful than a timestamp.

One could also guess an LSN based on a timestamp, but I think in
either case one has to be a pretty good guesser. The rate at which WAL
is generated is hardly guaranteed to be uniform, and if you're looking
at a system for the first time you may have no idea what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Make unlogged table resets detectable

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 7, 2021 at 11:46 PM Michael Paquier <michael@paquier.xyz> wrote:
>> On Mon, Jun 07, 2021 at 02:56:57PM -0400, Tom Lane wrote:
>>> +1.  I'd support recording the time of the last crash recovery, as
>>> well as having a counter.  I think an LSN would not be as useful
>>> as a timestamp.

>> One could guess a timestamp based on a LSN, no?  So I'd like to think
>> the opposite actually: a LSN would be more useful than a timestamp.

> One could also guess an LSN based on a timestamp, but I think in
> either case one has to be a pretty good guesser.

Yeah.  If there are actually use-cases for knowing both things, then
we ought to record both.  However, it's not real clear to me why
LSN would be interesting.

BTW, I spent a bit of time thinking about whether we should
record the timestamp at start or end of crash recovery; my conclusion
is we should record the latter.  It would only make a difference to
people who wanted to inspect the value (a) while crash recovery is
in progress or (b) after a failed crash recovery.  In both scenarios,
you have other mechanisms to discover the start time of the current
crash; while if we overwrite the pg_control field at the start,
there's no longer a way to know how long ago the previous crash was.
So it seems best not to overwrite the time of the previous crash
until we're up.

(If there is a reason to log LSN, maybe the argument is different
for that?  Although I'd think that looking at the last checkpoint
REDO location is sufficient for figuring out where the current
crash recovery attempt started.)

            regards, tom lane



Re: Make unlogged table resets detectable

От
Jeff Davis
Дата:
On Tue, 2021-06-08 at 12:52 -0400, Tom Lane wrote:
> Yeah.  If there are actually use-cases for knowing both things, then
> we ought to record both.  However, it's not real clear to me why
> LSN would be interesting.

Let me expand on my use case: in a sharded environment, how do you
figure out if you need to repopulate an UNLOGGED table? For a single
node, there's not much risk, because you either have the data or you
don't. But in a sharded environment, if one node crashes, you might end
up with some shards empty and others populated, and that's
inconsistent.

If Postgres provides a way to figure out when the last crash happened,
then that would give the sharding solution the basic information it
needs to figure out if it needs to clear and repopulate the entire
unlogged table (i.e. all its shards on all nodes).

Clearly, the sharding solution would need to do some tracking of its
own, like recording when the last TRUNCATE happened, to figure out what
to do. For that tracking, I think using the LSN makes more sense than a
timestamp.

> (If there is a reason to log LSN, maybe the argument is different
> for that?  Although I'd think that looking at the last checkpoint
> REDO location is sufficient for figuring out where the current
> crash recovery attempt started.)

I came to a similar conclusion for my use case: tracking the LSN at the
end of the recovery makes more sense.

I attached a patch to track last recovery LSN, time, and total count.
But there are a few issues:

1. Do we want a way to reset the counter? If so, should it be done with
pg_resetwal or a superuser SQL function?

2. It would be helpful to also know the last time a promotion happened,
for the same reason (e.g. a failover of a single node leading to an
unlogged table with some empty shards and some populated ones). Should
also store the last promotion LSN and time as well? Does "promotion
count" make sense, and should we track that, too?

3. Should we try to track crash information across promotions, or just
start them at the initial values when promoted?

Regards,
    Jeff Davis


Вложения

Re: Make unlogged table resets detectable

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2021-06-08 at 12:52 -0400, Tom Lane wrote:
>> Yeah.  If there are actually use-cases for knowing both things, then
>> we ought to record both.  However, it's not real clear to me why
>> LSN would be interesting.

> Let me expand on my use case: in a sharded environment, how do you
> figure out if you need to repopulate an UNLOGGED table?

Since we don't put LSNs into unlogged tables, nor would the different
shards be likely to have equivalent LSNs, I'm not seeing that LSN is
remarkably better for this than a timestamp.

> 1. Do we want a way to reset the counter? If so, should it be done with
> pg_resetwal or a superuser SQL function?

I'd be kind of inclined to say no, short of pg_resetwal, and maybe
not then.

> 2. It would be helpful to also know the last time a promotion happened,

I'm not following this either.  How do you unpromote a node?

            regards, tom lane



Re: Make unlogged table resets detectable

От
Jeff Davis
Дата:
On Tue, 2021-06-08 at 16:08 -0400, Tom Lane wrote:
> Since we don't put LSNs into unlogged tables, nor would the different
> shards be likely to have equivalent LSNs, I'm not seeing that LSN is
> remarkably better for this than a timestamp.

It requires some other bookkeeping on the part of the sharding
solution. This is ugly (alternative suggestions welcome), but I think
it would work:

1. The sharding code would create on each node:
  CREATE UNLOGGED TABLE unlogged_table_status(
    shard_name regclass,
    last_truncate pg_lsn);

2. When you create an unlogged table, each node would do:
  INSERT INTO unlogged_table_status
  VALUES('my_unlogged_shard', pg_current_wal_flush_lsn())

3. When you TRUNCATE an unlogged table, each node would do:
  UPDATE unlogged_table_status
  SET last_truncate=pg_current_wal_flush_lsn()
  WHERE shard_name='my_unlogged_shard'

4. When connecting to a node and accessing a shard of an unlogged table
for the first time, test whether the shard has been lost with:
  SELECT
  last_truncate <= (pg_control_recovery()).last_recovery_lsn
    AS shard_was_lost
  FROM unlogged_table_status
  WHERE shard_name='my_unlogged_shard'

5. If the shard was lost, truncate all shards for that table on all
nodes (and update the unlogged_table_status on all nodes as in #3).

Not exactly straightforward, but better than the current situation. And
I think it can be made more robust than a timestamp.

> I'd be kind of inclined to say no, short of pg_resetwal, and maybe
> not then.

Agreed, at least until we find some use case that says otherwise.

> > 2. It would be helpful to also know the last time a promotion
> > happened,
> 
> I'm not following this either.  How do you unpromote a node?

What I meant by "node" here is actually a primary+standby pair. Let's
say each primary+standby pair holds one shard of an unlogged table.

In this case, a crash followed by restart is equivalent to a primary
failing over to a promoted standby -- in either case, the shard is
gone, but other shards of the same table may be populated on other
primaries. We need to detect that the shard is gone and then wipe out
all the other shards on the healthy primaries.

You could reasonably say that it's the job of the sharding solution to
keep track of these crashes and handle unlogged tables at the time. But
it's inconvenient to insert more tasks into a sensitive process like
failover/recovery. It's preferable to be able to detect the unlogged
table problem after the fact and handle it when the systems are all up
and stable.

Regards,
    Jeff Davis





Re: Make unlogged table resets detectable

От
Justin Pryzby
Дата:
Is this patch targetting pg15 ?
There's no discussion since June.

Latest at 2021-06-08 21:29:25 by Jeff Davis <pgsql at j-davis.com>

2022-02-02 16:37:58     Julien Rouhaud (rjuju)     Closed in commitfest 2022-01 with status: Moved to next CF
2021-12-03 06:18:05     Michael Paquier (michael-kun)     Closed in commitfest 2021-11 with status: Moved to next CF
2021-10-04 16:32:49     Jaime Casanova (jcasanov)     Closed in commitfest 2021-09 with status: Moved to next CF
2021-08-03 02:29:40     Masahiko Sawada (masahikosawada)     Closed in commitfest 2021-07 with status: Moved to next
CF



Re: Make unlogged table resets detectable

От
Michael Paquier
Дата:
On Fri, Mar 04, 2022 at 10:12:27AM -0600, Justin Pryzby wrote:
> Is this patch targetting pg15 ?
> There's no discussion since June.
>
> Latest at 2021-06-08 21:29:25 by Jeff Davis <pgsql at j-davis.com>

This is too long, so let's discard this patch for now.
--
Michael

Вложения