Обсуждение: Using of --data-checksums

Поиск
Список
Период
Сортировка

Using of --data-checksums

От
BGoebel
Дата:
initdb --data-checksums "... help to detect corruption by the I/O system"
There is an (negligible?) impact on performance, ok. 
 
Is there another reason NOT to use this feature ?
Has anyone had good or bad experience with the use of  --data-checksums?

Thanks in advance!

Bernhard



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Using of --data-checksums

От
Michael Paquier
Дата:
On Tue, Apr 07, 2020 at 08:10:13AM -0700, BGoebel wrote:
> initdb --data-checksums "... help to detect corruption by the I/O system"
> There is an (negligible?) impact on performance, ok.
>
> Is there another reason NOT to use this feature ?
> Has anyone had good or bad experience with the use of  --data-checksums?

FWIW, I have a good experience with it.  Note that some performance
impact of up to ~1% may be noticeable if you have a large number of
buffer evictions from PostgreSQL shared buffer pool, but IMO the
insurance of knowing that Postgres is not the cause of an on-disk
corruption is largely worth it (in applications where I got that
enabled we did not notice any performance impact even in very heavy
production-like workloads, and this even if we had a rather low shared
buffer setting with a much larger set of hot pages, causing the OS
cache to be filled with most of the hot data).
--
Michael

Вложения

Re: Using of --data-checksums

От
Stephen Frost
Дата:
Greetings,

* BGoebel (b.goebel@prisma-computer.de) wrote:
> initdb --data-checksums "... help to detect corruption by the I/O system"
> There is an (negligible?) impact on performance, ok.
>
> Is there another reason NOT to use this feature ?

Not in my view.

> Has anyone had good or bad experience with the use of  --data-checksums?

Have had good experience with it.  We should really make it the default
already.

Thanks,

Stephen

Вложения

Re: Using of --data-checksums

От
Bruce Momjian
Дата:
On Wed, Apr  8, 2020 at 11:54:34AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * BGoebel (b.goebel@prisma-computer.de) wrote:
> > initdb --data-checksums "... help to detect corruption by the I/O system"
> > There is an (negligible?) impact on performance, ok. 
> >  
> > Is there another reason NOT to use this feature ?
> 
> Not in my view.
> 
> > Has anyone had good or bad experience with the use of  --data-checksums?
> 
> Have had good experience with it.  We should really make it the default
> already.

Yeah, but I think we wanted more ability to change an existing cluster
before doing that since it would affect pg_upgraded servers.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: Using of --data-checksums

От
Stephen Frost
Дата:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Wed, Apr  8, 2020 at 11:54:34AM -0400, Stephen Frost wrote:
> > * BGoebel (b.goebel@prisma-computer.de) wrote:
> > > initdb --data-checksums "... help to detect corruption by the I/O system"
> > > There is an (negligible?) impact on performance, ok.
> > >
> > > Is there another reason NOT to use this feature ?
> >
> > Not in my view.
> >
> > > Has anyone had good or bad experience with the use of  --data-checksums?
> >
> > Have had good experience with it.  We should really make it the default
> > already.
>
> Yeah, but I think we wanted more ability to change an existing cluster
> before doing that since it would affect pg_upgraded servers.

There's definitely a lot of reasons to want to have the ability to
change an existing cluster.  Considering the complications around
running pg_upgrade already, I don't really think that changing the
default of initdb would be that big a hurdle for folks to deal with-
they'd try the pg_upgrade, get a very quick error that the new cluster
has checksums enabled and the old one didn't, and they'd re-initdb the
new cluster and then re-run pg_upgrade to figure out what the next issue
is..

Thanks,

Stephen

Вложения

Re: Using of --data-checksums

От
Michael Paquier
Дата:
On Fri, Apr 10, 2020 at 04:37:46PM -0400, Stephen Frost wrote:
> There's definitely a lot of reasons to want to have the ability to
> change an existing cluster.  Considering the complications around
> running pg_upgrade already, I don't really think that changing the
> default of initdb would be that big a hurdle for folks to deal with-
> they'd try the pg_upgrade, get a very quick error that the new cluster
> has checksums enabled and the old one didn't, and they'd re-initdb the
> new cluster and then re-run pg_upgrade to figure out what the next issue
> is..

We discussed that a couple of months ago, and we decided to keep that
out of the upgrade story, no?  Anyway, if you want to enable or
disable data checksums on an existing cluster, you always have the
possibility to use pg_checksums --enable.  This exists in core since
12, and there is also a version on out of core for older versions of
Postgres: https://github.com/credativ/pg_checksums.  On apt-based
distributions like Debian, this stuff is under the package
postgresql-12-pg-checksums.
--
Michael

Вложения

Re: Using of --data-checksums

От
Magnus Hagander
Дата:


On Sun, Apr 12, 2020 at 8:05 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Apr 10, 2020 at 04:37:46PM -0400, Stephen Frost wrote:
> There's definitely a lot of reasons to want to have the ability to
> change an existing cluster.  Considering the complications around
> running pg_upgrade already, I don't really think that changing the
> default of initdb would be that big a hurdle for folks to deal with-
> they'd try the pg_upgrade, get a very quick error that the new cluster
> has checksums enabled and the old one didn't, and they'd re-initdb the
> new cluster and then re-run pg_upgrade to figure out what the next issue
> is..

We discussed that a couple of months ago, and we decided to keep that
out of the upgrade story, no?  Anyway, if you want to enable or
disable data checksums on an existing cluster, you always have the
possibility to use pg_checksums --enable.  This exists in core since
12, and there is also a version on out of core for older versions of
Postgres: https://github.com/credativ/pg_checksums.  On apt-based
distributions like Debian, this stuff is under the package
postgresql-12-pg-checksums.

The fact that this tool exists, and then in the format of pg_checksums --disable, I think is what makes the argument to turn on checksums by default possible. Because it's now very easy and fast to turn it off even if you've accumulated sizable data in your cluster. (Turning it on in this case is easy, but not fast). 

And FWIW, I do think we should change the default. And maybe spend some extra effort on the message coming out of pg_upgrade in this case to make it clear to people what their options are and exactly what to do.

--

Re: Using of --data-checksums

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> And FWIW, I do think we should change the default. And maybe spend some
> extra effort on the message coming out of pg_upgrade in this case to make
> it clear to people what their options are and exactly what to do.

Is there any hard evidence of checksums catching problems at all?
Let alone in sufficient number to make them be on-by-default?

            regards, tom lane



Re: Using of --data-checksums

От
Michael Paquier
Дата:
On Sun, Apr 12, 2020 at 10:23:24AM -0400, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> And FWIW, I do think we should change the default. And maybe spend some
>> extra effort on the message coming out of pg_upgrade in this case to make
>> it clear to people what their options are and exactly what to do.
>
> Is there any hard evidence of checksums catching problems at all?
> Let alone in sufficient number to make them be on-by-default?

I don't know if that's a sufficient number, but I have dealt with
corruption cases on virtual environments where these have been really
essential to find out proof that the origin of the problem was not
Postgres because those bugs created wild and incorrect block
overwrites.  With the software stack getting more complicated, making
them the default would make sense IMO.  Now the case of upgrades is
more tricky than it is, no?  There is a copy of the file so we may be
able to do a block-to-block copy and update of the checksum, but you
cannot do that with the --link mode.
--
Michael

Вложения

Re: Using of --data-checksums

От
Magnus Hagander
Дата:
On Sun, Apr 12, 2020 at 4:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> And FWIW, I do think we should change the default. And maybe spend some
> extra effort on the message coming out of pg_upgrade in this case to make
> it clear to people what their options are and exactly what to do.

Is there any hard evidence of checksums catching problems at all?
Let alone in sufficient number to make them be on-by-default?

I would say yes. I've certainly had a fair number of cases where they've detected storage corruption, especially with larger SAN type installation. And coupled with validating the checksum on backup (either with pg_basebackup or pgbackrest) it enables you to find the errors *early*, while you can still restore a previous backup and replay WAL to get to a point where you don't have to lose any data.

I believe both Stephen and David have some good stories they've heard from people catching such issues with backrest as well. 

This and as Michael also points out, it lets you know that the problem occurred outside of PostgreSQL, makes for very important information when tracking down issues.

--

Re: Using of --data-checksums

От
Jeremy Schneider
Дата:
On 4/12/20 07:23, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> And FWIW, I do think we should change the default. And maybe spend some
>> extra effort on the message coming out of pg_upgrade in this case to make
>> it clear to people what their options are and exactly what to do.
> 
> Is there any hard evidence of checksums catching problems at all?
> Let alone in sufficient number to make them be on-by-default?

Data checksums are a hard requirement across the entire RDS PostgreSQL
fleet - we do not allow it to be disabled in RDS. I've definitely seen a
lot of hard evidence (for example, customer cases I've personally been
involved in) that it catches problems. I could not exaggerate how useful
and important I think this feature is: being able to very quickly and
easily know that a problem originated outside of the PostgreSQL code.
This was in part what led to that long blog article I wrote about
checksums, and it's why enabling checksums was happiness hint #1 until I
broke them into categories.

FWIW, I also strongly agree that checksums should be enabled by default
in the git.postgresql.org code.

-Jeremy


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services



Re: Using of --data-checksums

От
Michael Paquier
Дата:
On Thu, Apr 16, 2020 at 03:47:34PM -0700, Jeremy Schneider wrote:
> Data checksums are a hard requirement across the entire RDS PostgreSQL
> fleet - we do not allow it to be disabled in RDS. I've definitely seen a
> lot of hard evidence (for example, customer cases I've personally been
> involved in) that it catches problems.

Oh, that's good to know.  Thanks for the information.  I pushed hard
as well to make this a requirement on what I work on.

> I could not exaggerate how useful
> and important I think this feature is: being able to very quickly and
> easily know that a problem originated outside of the PostgreSQL code.

The worst part with checksums disabled is having to tell a customer or
a support staff that you don't actually know from where the problem
comes, what is the actual origin of it, and why you think that the
error you are seeing in the Postgres logs is most likely linked to a
lower-level corruption as there can be many patterns, like broken
btree pages, toast errors, missing attributes in catalogs, failures
with FK references, primary key duplicates, etc.  And people like
to complain a lot about the database being broken because that's a
very sensitive piece and usually more things depend on it.  With
checksums enabled, you still cannot say exactly from where the problem
comes, but you can redirect the complains more easily to the correct
people to help find out what the actual problem is.  Even better, you
can also know if a problem probably comes directly from Postgres and
some backend logic if you don't see a checksum failure (note that
could be as well a misdesigned HA workflow, custom backup script as
well who knows but at least you know that something you control
directly gets wrong).  And the error message provided is clear.

> This was in part what led to that long blog article I wrote about
> checksums, and it's why enabling checksums was happiness hint #1 until I
> broke them into categories.

Reference? ;p
--
Michael

Вложения