Обсуждение: PANIC: could not flush dirty data: Operation not permitted power8,Redhat Centos

Поиск
Список
Период
Сортировка

PANIC: could not flush dirty data: Operation not permitted power8,Redhat Centos

От
reiner peterke
Дата:
Hi All,

We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8

No error on x86_64 (RH, Centos and  SUSE)

from the log file
2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv4 address "0.0.0.0", port 5432
2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv6 address "::", port 5432
2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-04-09 12:30:10 UTC   pid:204 xid:0 ip: LOG:  database system was shut down at 2019-04-09 12:27:09 UTC
2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  database system is ready to accept connections
2019-04-09 12:31:46 UTC   pid:203 xid:0 ip: LOG:  received SIGHUP, reloading configuration files
2019-04-09 12:35:10 UTC   pid:205 xid:0 ip: PANIC:  could not flush dirty data: Operation not permitted
2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  checkpointer process (PID 205) was terminated by signal 6: Aborted
2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  terminating any other active server processes
2019-04-09 12:35:10 UTC   pid:208 xid:0 ip: WARNING:  terminating connection because of crash of another server process
2019-04-09 12:35:10 UTC   pid:208 xid:0 ip: DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-04-09 12:35:10 UTC   pid:208 xid:0 ip: HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  all server processes terminated; reinitializing
2019-04-09 12:35:10 UTC   pid:224 xid:0 ip: LOG:  database system was interrupted; last known up at 2019-04-09 12:30:10 UTC
2019-04-09 12:35:10 UTC   pid:224 xid:0 ip: PANIC:  could not flush dirty data: Operation not permitted
2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  startup process (PID 224) was terminated by signal 6: Aborted
2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  aborting startup due to startup process failure
2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  database system is shut down

from pg_config

pg_config output

BINDIR = /usr/local/postgres/11/bin
DOCDIR = /usr/local/postgres/11/share/doc
HTMLDIR = /usr/local/postgres/11/share/doc
INCLUDEDIR = /usr/local/postgres/11/include
PKGINCLUDEDIR = /usr/local/postgres/11/include
INCLUDEDIR-SERVER = /usr/local/postgres/11/include/server
LIBDIR = /usr/local/postgres/11/lib
PKGLIBDIR = /usr/local/postgres/11/lib
LOCALEDIR = /usr/local/postgres/11/share/locale
MANDIR = /usr/local/postgres/11/share/man
SHAREDIR = /usr/local/postgres/11/share
SYSCONFDIR = /usr/local/postgres/etc
PGXS = /usr/local/postgres/11/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-tclconfig=/usr/lib64' '--with-perl' '--with-python' '--with-tcl' '--with-openssl' '--with-pam' '--with-gssapi' '--enable-nls' '--with-libxml' '--with-libxslt' '--with-ldap' '--prefix=/usr/local/postgres/11' 'CFLAGS=-O3 -g -pipe -Wall -D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -m64 -mcpu=power8 -mtune=power8 -DLINUX_OOM_SCORE_ADJ=0' '--with-libs=/usr/lib' '--with-includes=/usr/include' '--with-uuid=e2fs' '--sysconfdir=/usr/local/postgres/etc' '--with-llvm' 'PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O3 -g -pipe -Wall -D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -m64 -mcpu=power8 -mtune=power8 -DLINUX_OOM_SCORE_ADJ=0
CFLAGS_SL = -fPIC
LDFLAGS = -L/usr/local/lib -L/usr/lib -Wl,--as-needed -Wl,-rpath,'/usr/local/postgres/11/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lxslt -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lrt -lcrypt -ldl -lm
VERSION = PostgreSQL 11.2

I get the feeling this is related to the fsync() issue.
why is it happening on Power RH and CentOS, but not on the other platforms?

Let me know if i need to provide any more information.

Reiner

Вложения

Re: PANIC: could not flush dirty data: Operation not permittedpower8, Redhat Centos

От
Andres Freund
Дата:
Hi,

On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
> We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
> power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8

> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv4 address "0.0.0.0", port 5432
> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv6 address "::", port 5432
> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
> 2019-04-09 12:30:10 UTC   pid:204 xid:0 ip: LOG:  database system was shut down at 2019-04-09 12:27:09 UTC
> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  database system is ready to accept connections
> 2019-04-09 12:31:46 UTC   pid:203 xid:0 ip: LOG:  received SIGHUP, reloading configuration files
> 2019-04-09 12:35:10 UTC   pid:205 xid:0 ip: PANIC:  could not flush dirty data: Operation not permitted
> 2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  checkpointer process (PID 205) was terminated by signal 6: Aborted

Any chance you can strace this? Because I don't understand how you'd get
a permission error here.


> I get the feeling this is related to the fsync() issue.
> why is it happening on Power RH and CentOS, but not on the other platforms?

Yea, the PANIC is due to various OSs, including linux, basically feeling
free to discard any diryt data after any integrity related calls fail
(we could narrow it down, but it's hard, given the variability between
versions). That is, if they signal such issues at all :(

Greetings,

Andres Freund



Re: PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
>> We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
>> power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8

> Any chance you can strace this? Because I don't understand how you'd get
> a permission error here.

What kind of filesystem are the database files on?

            regards, tom lane



Re: PANIC: could not flush dirty data: Operation not permittedpower8, Redhat Centos

От
Thomas Munro
Дата:
On Sat, Apr 13, 2019 at 7:23 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
> > We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
> > power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8

Huh,  I wonder what is different.  I don't see this on EDB's CentOS
7.1 POWER8 system with an XFS filesystem.  I ran it under strace -f
and saw this:

[pid 51614] sync_file_range2(0x19, 0x2, 0x8000, 0x2000, 0x2, 0x8) = 0

> > 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv4 address "0.0.0.0", port 5432
> > 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv6 address "::", port 5432
> > 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
> > 2019-04-09 12:30:10 UTC   pid:204 xid:0 ip: LOG:  database system was shut down at 2019-04-09 12:27:09 UTC
> > 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  database system is ready to accept connections
> > 2019-04-09 12:31:46 UTC   pid:203 xid:0 ip: LOG:  received SIGHUP, reloading configuration files
> > 2019-04-09 12:35:10 UTC   pid:205 xid:0 ip: PANIC:  could not flush dirty data: Operation not permitted
> > 2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  checkpointer process (PID 205) was terminated by signal 6:
Aborted
>
> Any chance you can strace this? Because I don't understand how you'd get
> a permission error here.

Me neither.  I hacked my tree so that it would use the msync() version
instead of the sync_file_range() version but that worked too.

-- 
Thomas Munro
https://enterprisedb.com



Re: PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos

От
zedaardv@gmail.com
Дата:

sent by smoke signals at great danger to my self.

> On 12 Apr 2019, at 23:16, Thomas Munro <thomas.munro@gmail.com> wrote:
>
>> On Sat, Apr 13, 2019 at 7:23 AM Andres Freund <andres@anarazel.de> wrote:
>>> On 2019-04-12 20:04:00 +0200, reiner peterke wrote:
>>> We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
>>> power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8
>
> Huh,  I wonder what is different.  I don't see this on EDB's CentOS
> 7.1 POWER8 system with an XFS filesystem.  I ran it under strace -f
> and saw this:
>
> [pid 51614] sync_file_range2(0x19, 0x2, 0x8000, 0x2000, 0x2, 0x8) = 0
>
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv4 address "0.0.0.0", port 5432
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on IPv6 address "::", port 5432
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
>>> 2019-04-09 12:30:10 UTC   pid:204 xid:0 ip: LOG:  database system was shut down at 2019-04-09 12:27:09 UTC
>>> 2019-04-09 12:30:10 UTC   pid:203 xid:0 ip: LOG:  database system is ready to accept connections
>>> 2019-04-09 12:31:46 UTC   pid:203 xid:0 ip: LOG:  received SIGHUP, reloading configuration files
>>> 2019-04-09 12:35:10 UTC   pid:205 xid:0 ip: PANIC:  could not flush dirty data: Operation not permitted
>>> 2019-04-09 12:35:10 UTC   pid:203 xid:0 ip: LOG:  checkpointer process (PID 205) was terminated by signal 6:
Aborted
>>
>> Any chance you can strace this? Because I don't understand how you'd get
>> a permission error here.
>
> Me neither.  I hacked my tree so that it would use the msync() version
> instead of the sync_file_range() version but that worked too.
>
> --
> Thomas Munro
> https://enterprisedb.com

I forgot to mention that this is happening in a docker container.
I want to test it on a VM to see if it is container related. I am sick at the moment so i’m unable to do the test at
themoment.  

Reiner


Re: PANIC: could not flush dirty data: Operation not permittedpower8, Redhat Centos

От
Justin Pryzby
Дата:
On Fri, Apr 12, 2019 at 08:04:00PM +0200, reiner peterke wrote:
> We build Postgres on Power and x86 With the latest Postgres 11 release (11.2) we get error on
> power8 ppc64le (Redhat and CentOS).  No error on SUSE on power8
> 
> No error on x86_64 (RH, Centos and  SUSE)

So there's an error on power8 with RH but not SUSE.

What kernel versions are used for each of the successful and not successful ?

Justin



Re: PANIC: could not flush dirty data: Operation not permittedpower8, Redhat Centos

От
Thomas Munro
Дата:
On Mon, Apr 15, 2019 at 7:57 PM <zedaardv@gmail.com> wrote:
> I forgot to mention that this is happening in a docker container.

Huh, so there may be some configuration of Linux container that can
fail here with EPERM, even though that error that does not appear in
the man page, and doesn't make much intuitive sense.  Would be good to
figure out how that happens.

If we could somehow confirm* that sync_file_range() with the
non-waiting flags we are using is non-destructive of error state, as
Andres speculated (that is, it cannot eat the only error report we're
ever going to get to tell us that buffered dirty data may have been
dropped), then I suppose we could just remove the data_sync_elevel()
promotion here.  As with the WSL case (before the PANIC commit and the
subsequent don't-repeat-the-warning-forever patch), a user of this
posited EPERM-generating container configuration would then get
repeated warnings in the log forever (as they presumably did before).
Repeated WARNING messages are probably OK here, I think... I mean, if,
say, someone complains that FlubOS's Linux emulation fails here with
EIEIO, I'd say they should put up with the warnings and complain over
on the flub-hackers list, or whatever, and I'd say the same for
containers that generate EPERM: either the man page or the containter
technology needs work.

But... I still think we should try to avoid making decisions based on
knowledge of kernel implementation details, if it can be avoided.  I'd
probably rather treat EPERM explicitly differently (and eventually
EIEIO too, if a report comes in) than drop the current paranoid coding
completely.

*I'm not looking at it myself.  A sync_file_range() implementation is
on my list of potential FreeBSD projects for a rainy day, so I don't
want to study anything but the man page, even if it's wrong.

-- 
Thomas Munro
https://enterprisedb.com



Re: PANIC: could not flush dirty data: Operation not permittedpower8, Redhat Centos

От
Thomas Munro
Дата:
On Wed, Apr 17, 2019 at 1:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Apr 15, 2019 at 7:57 PM <zedaardv@gmail.com> wrote:
> > I forgot to mention that this is happening in a docker container.
>
> Huh, so there may be some configuration of Linux container that can
> fail here with EPERM, even though that error that does not appear in
> the man page, and doesn't make much intuitive sense.  Would be good to
> figure out how that happens.

Steve Dodd ran into the same problem in Borg[1].  It looks like what's
happening here is that on PowerPC and ARM systems, there is a second
system call sync_file_range2 that has the arguments arranged in a
better order for their calling conventions (see Notes section of man
sync_file_range), and glibc helpfully translates for you, but some
container technologies forgot to include sync_file_range2 in their
syscall forwarding table.  Perhaps we should just handle this with the
not_implemented_by_kernel mechanism I added for WSL.

[1] https://lists.freedesktop.org/archives/systemd-devel/2019-August/043276.html

-- 
Thomas Munro
https://enterprisedb.com



Re: PANIC: could not flush dirty data: Operation not permittedpower8, Redhat Centos

От
Thomas Munro
Дата:
On Mon, Aug 19, 2019 at 7:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 17, 2019 at 1:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Mon, Apr 15, 2019 at 7:57 PM <zedaardv@gmail.com> wrote:
> > > I forgot to mention that this is happening in a docker container.
> >
> > Huh, so there may be some configuration of Linux container that can
> > fail here with EPERM, even though that error that does not appear in
> > the man page, and doesn't make much intuitive sense.  Would be good to
> > figure out how that happens.
>
> Steve Dodd ran into the same problem in Borg[1].  It looks like what's
> happening here is that on PowerPC and ARM systems, there is a second
> system call sync_file_range2 that has the arguments arranged in a
> better order for their calling conventions (see Notes section of man
> sync_file_range), and glibc helpfully translates for you, but some
> container technologies forgot to include sync_file_range2 in their
> syscall forwarding table.  Perhaps we should just handle this with the
> not_implemented_by_kernel mechanism I added for WSL.

I've just heard that it was fixed overnight in seccomp, which is
probably what Docker is using to give you EPERM for syscalls it
doesn't like the look of:

https://github.com/systemd/systemd/pull/13352/commits/90ddac6087b5f8f3736364cfdf698e713f7e8869

Not being a Docker user, I'm sure if/when that will flow into the
right places in a timely fashion but if not it looks like you can
always configure your own profile or take one from somewhere else,
probably something like this:

https://github.com/moby/moby/commit/52d8f582c331e35f7b841171a1c22e2d9bbfd0b8

So it looks like we don't need to do anything at all on our side,
unless someone knows better.

-- 
Thomas Munro
https://enterprisedb.com