Обсуждение: BUG #7670: BUG #7545: Unresponsive server with error log reporting: "poll() failed: Invalid argument"

Поиск
Список
Период
Сортировка

BUG #7670: BUG #7545: Unresponsive server with error log reporting: "poll() failed: Invalid argument"

От
igorya.inscriptio@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      7670
Logged by:          Igor
Email address:      igorya.inscriptio@gmail.com
PostgreSQL version: 9.2.1
Operating system:   FreeBSD 9.0
Description:        =


The same a BUG as 7545.

Pgsql compiled from FreeBSD ports with options:

_OPTIONS_READ=3Dpostgresql-server-9.2.1
_FILE_COMPLETE_OPTIONS_LIST=3D NLS DTRACE PAM LDAP MIT_KRB5 HEIMDAL_KRB5
GSSAPI OPTIMIZED_CFLAGS XML TZDATA DEBUG ICU INTDATE SSL
OPTIONS_FILE_SET+=3DNLS
OPTIONS_FILE_UNSET+=3DDTRACE
OPTIONS_FILE_UNSET+=3DPAM
OPTIONS_FILE_UNSET+=3DLDAP
OPTIONS_FILE_UNSET+=3DMIT_KRB5
OPTIONS_FILE_UNSET+=3DHEIMDAL_KRB5
OPTIONS_FILE_UNSET+=3DGSSAPI
OPTIONS_FILE_UNSET+=3DOPTIMIZED_CFLAGS
OPTIONS_FILE_SET+=3DXML
OPTIONS_FILE_SET+=3DTZDATA
OPTIONS_FILE_UNSET+=3DDEBUG
OPTIONS_FILE_UNSET+=3DICU
OPTIONS_FILE_SET+=3DINTDATE
OPTIONS_FILE_SET+=3DSSL
On 2012-11-17 21:00:36 +0000, igorya.inscriptio@gmail.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      7670
> Logged by:          Igor
> Email address:      igorya.inscriptio@gmail.com
> PostgreSQL version: 9.2.1
> Operating system:   FreeBSD 9.0
> Description:
>
> The same a BUG as 7545.
>
> Pgsql compiled from FreeBSD ports with options:
>
> _OPTIONS_READ=postgresql-server-9.2.1
> _FILE_COMPLETE_OPTIONS_LIST= NLS DTRACE PAM LDAP MIT_KRB5 HEIMDAL_KRB5
> GSSAPI OPTIMIZED_CFLAGS XML TZDATA DEBUG ICU INTDATE SSL
> OPTIONS_FILE_SET+=NLS
> OPTIONS_FILE_UNSET+=DTRACE
> OPTIONS_FILE_UNSET+=PAM
> OPTIONS_FILE_UNSET+=LDAP
> OPTIONS_FILE_UNSET+=MIT_KRB5
> OPTIONS_FILE_UNSET+=HEIMDAL_KRB5
> OPTIONS_FILE_UNSET+=GSSAPI
> OPTIONS_FILE_UNSET+=OPTIMIZED_CFLAGS
> OPTIONS_FILE_SET+=XML
> OPTIONS_FILE_SET+=TZDATA
> OPTIONS_FILE_UNSET+=DEBUG
> OPTIONS_FILE_UNSET+=ICU
> OPTIONS_FILE_SET+=INTDATE
> OPTIONS_FILE_SET+=SSL

Could you give a bit more context about when exactly that error is
happening?
It would also be helpful if you could set log_error_verbosity =
'verbose'; in the config file and report back with the complete error
message, including line numbers.

If the error is easy to reproduce it would also be helpful to see the
failing command with ktruss or similar.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
...

Could you give a bit more context about when exactly that error is
> happening?
>

No :( This is the random event.

It would also be helpful if you could set log_error_verbosity =
> 'verbose';


Done. And will wait.

--
Regards
System and web developer.
<http://amkmobile.com>
..

> Could you give a bit more context about when exactly that error is
> happening?
> It would also be helpful if you could set log_error_verbosity =
> 'verbose'; in the config file and report back with the complete error
> message, including line numbers.
>
>
I try to restart DB and seems get this situation:

===
2012-11-18 18:30:25 EET    pid:93744  LOG:  received SIGHUP, reloading
configuration files
2012-11-18 18:30:25 EET    pid:93744  LOG:  00000: parameter
"log_error_verbosity" changed to "verbose"
2012-11-18 18:30:25 EET    pid:93744  LOCATION:  ProcessConfigFile,
guc-file.l:318
2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: received fast shutdown
request
2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2187
2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: aborting any active
transactions
2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2211
2012-11-18 18:36:45 EET    pid:93749  LOG:  00000: autovacuum launcher
shutting down
2012-11-18 18:36:45 EET    pid:93749  LOCATION:  AutoVacLauncherMain,
autovacuum.c:767
2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: shutting down
2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG, xlog.c:7560
2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: database system is shut
down
2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG, xlog.c:7582
2012-11-18 18:36:46 EET    pid:95528  FATAL:  XX000: poll() failed: Invalid
argument
2012-11-18 18:36:46 EET    pid:95528  LOCATION:  WaitLatchOrSocket,
pg_latch.c:296
===

After several minutes the rc.d script said:

pg_ctl: could not start server
Examine the log output.

Then I did:
killall -2 postgres
killall -3 postgres

And this remains procesess:
95527 pgsql           1  52    0  2159M 82400K pipewr 10   0:00  0.00%
postgres
95540 pgsql           1  20    0  2163M 83432K pipewr  0   0:00  0.00%
postgres

I can stop only via killall -9 :(

--
Regards
System and web developer.
<http://amkmobile.com>
And for my previous letter:

I feel :) that the problem is in "logging_collector" setting - I used this
feature. But now I comments this setting in config. And will watching ..


--
Regards
System and web developer.
<http://amkmobile.com>
On 2012-11-18 18:53:13 +0200, Igor wrote:
>  ..
>
> > Could you give a bit more context about when exactly that error is
> > happening?
> > It would also be helpful if you could set log_error_verbosity =
> > 'verbose'; in the config file and report back with the complete error
> > message, including line numbers.
> >
> >
> I try to restart DB and seems get this situation:
>
> ===
> 2012-11-18 18:30:25 EET    pid:93744  LOG:  received SIGHUP, reloading
> configuration files
> 2012-11-18 18:30:25 EET    pid:93744  LOG:  00000: parameter
> "log_error_verbosity" changed to "verbose"
> 2012-11-18 18:30:25 EET    pid:93744  LOCATION:  ProcessConfigFile,
> guc-file.l:318
> 2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: received fast shutdown
> request
> 2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2187
> 2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: aborting any active
> transactions
> 2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2211
> 2012-11-18 18:36:45 EET    pid:93749  LOG:  00000: autovacuum launcher
> shutting down
> 2012-11-18 18:36:45 EET    pid:93749  LOCATION:  AutoVacLauncherMain,
> autovacuum.c:767
> 2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: shutting down
> 2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG, xlog.c:7560
> 2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: database system is shut
> down
> 2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG, xlog.c:7582
> 2012-11-18 18:36:46 EET    pid:95528  FATAL:  XX000: poll() failed: Invalid
> argument
> 2012-11-18 18:36:46 EET    pid:95528  LOCATION:  WaitLatchOrSocket,
> pg_latch.c:296
> ===

Ok, this is already helpful. Do you have anything the log that shows
what process pid 95528 is? Could you grep for it?

Are you able to recompile postgres with other compilation options and/or
some debugging patches applied?
The first thing would be to recompile postgres with --enable-cassert...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
2012/11/18 Andres Freund <andres@2ndquadrant.com>

> On 2012-11-18 18:53:13 +0200, Igor wrote:
> >  ..
> >
> > > Could you give a bit more context about when exactly that error is
> > > happening?
> > > It would also be helpful if you could set log_error_verbosity =
> > > 'verbose'; in the config file and report back with the complete error
> > > message, including line numbers.
> > >
> > >
> > I try to restart DB and seems get this situation:
> >
> > ===
> > 2012-11-18 18:30:25 EET    pid:93744  LOG:  received SIGHUP, reloading
> > configuration files
> > 2012-11-18 18:30:25 EET    pid:93744  LOG:  00000: parameter
> > "log_error_verbosity" changed to "verbose"
> > 2012-11-18 18:30:25 EET    pid:93744  LOCATION:  ProcessConfigFile,
> > guc-file.l:318
> > 2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: received fast shutdown
> > request
> > 2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2187
> > 2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: aborting any active
> > transactions
> > 2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2211
> > 2012-11-18 18:36:45 EET    pid:93749  LOG:  00000: autovacuum launcher
> > shutting down
> > 2012-11-18 18:36:45 EET    pid:93749  LOCATION:  AutoVacLauncherMain,
> > autovacuum.c:767
> > 2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: shutting down
> > 2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG,
> xlog.c:7560
> > 2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: database system is
> shut
> > down
> > 2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG,
> xlog.c:7582
> > 2012-11-18 18:36:46 EET    pid:95528  FATAL:  XX000: poll() failed:
> Invalid
> > argument
> > 2012-11-18 18:36:46 EET    pid:95528  LOCATION:  WaitLatchOrSocket,
> > pg_latch.c:296
> > ===
>
> Ok, this is already helpful. Do you have anything the log that shows
> what process pid 95528 is? Could you grep for it?
>

No, this string - is the last string in the log for this event.
I think, I finished the process 95528 by kill -3 (or -2) signal. I did this
after unsuccessfully report of the pg_ctl (rc.d script).


> Are you able to recompile postgres with other compilation options and/or
> some debugging patches applied?
>

Will try. Give me this patches.


> The first thing would be to recompile postgres with --enable-cassert...
>

That all ?
Do you understand my english ? :)

--
Regards
System and web developer.
<http://amkmobile.com>
Igor <igorya.inscriptio@gmail.com> writes:
> I feel :) that the problem is in "logging_collector" setting - I used this
> feature. But now I comments this setting in config. And will watching ..

I was just about to ask what you have log_rotation_age set to ...

            regards, tom lane
On 2012-11-18 19:22:04 +0200, Igor wrote:
> 2012/11/18 Andres Freund <andres@2ndquadrant.com>
>
> > On 2012-11-18 18:53:13 +0200, Igor wrote:
> > >  ..
> > >
> > > > Could you give a bit more context about when exactly that error is
> > > > happening?
> > > > It would also be helpful if you could set log_error_verbosity =
> > > > 'verbose'; in the config file and report back with the complete error
> > > > message, including line numbers.
> > > >
> > > >
> > > I try to restart DB and seems get this situation:
> > >
> > > ===
> > > 2012-11-18 18:30:25 EET    pid:93744  LOG:  received SIGHUP, reloading
> > > configuration files
> > > 2012-11-18 18:30:25 EET    pid:93744  LOG:  00000: parameter
> > > "log_error_verbosity" changed to "verbose"
> > > 2012-11-18 18:30:25 EET    pid:93744  LOCATION:  ProcessConfigFile,
> > > guc-file.l:318
> > > 2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: received fast shutdown
> > > request
> > > 2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2187
> > > 2012-11-18 18:36:45 EET    pid:93744  LOG:  00000: aborting any active
> > > transactions
> > > 2012-11-18 18:36:45 EET    pid:93744  LOCATION:  pmdie, postmaster.c:2211
> > > 2012-11-18 18:36:45 EET    pid:93749  LOG:  00000: autovacuum launcher
> > > shutting down
> > > 2012-11-18 18:36:45 EET    pid:93749  LOCATION:  AutoVacLauncherMain,
> > > autovacuum.c:767
> > > 2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: shutting down
> > > 2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG,
> > xlog.c:7560
> > > 2012-11-18 18:36:45 EET    pid:93746  LOG:  00000: database system is
> > shut
> > > down
> > > 2012-11-18 18:36:45 EET    pid:93746  LOCATION:  ShutdownXLOG,
> > xlog.c:7582
> > > 2012-11-18 18:36:46 EET    pid:95528  FATAL:  XX000: poll() failed:
> > Invalid
> > > argument
> > > 2012-11-18 18:36:46 EET    pid:95528  LOCATION:  WaitLatchOrSocket,
> > > pg_latch.c:296
> > > ===
> >
> > Ok, this is already helpful. Do you have anything the log that shows
> > what process pid 95528 is? Could you grep for it?
> >
>
> No, this string - is the last string in the log for this event.

Actually I would be interested in *earlier* entries in the log ;)


> > Are you able to recompile postgres with other compilation options and/or
> > some debugging patches applied?

> Will try. Give me this patches.

Ok, let me prepare something.

> > The first thing would be to recompile postgres with --enable-cassert...
> >
>
> That all ?

Yes, that could already catch the source of the error. Note that you
probably will want to revert to non --enable-cassert builds after we
debugged this, it has a non-negligible overhead.

> Do you understand my english ? :)

I try ;)

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
2012/11/18 Tom Lane <tgl@sss.pgh.pa.us>

> Igor <igorya.inscriptio@gmail.com> writes:
> > I feel :) that the problem is in "logging_collector" setting - I used
> this
> > feature. But now I comments this setting in config. And will watching ..
>
> I was just about to ask what you have log_rotation_age set to ...
>
>
Was:
log_rotation_age = 30d

--
Regards
System and web developer.
<http://amkmobile.com>

Re: BUG #7670: BUG #7545: Unresponsive server with error log reporting: "poll() failed: Invalid argument"

От
Peter Geoghegan
Дата:
On 18 November 2012 17:12, Andres Freund <andres@2ndquadrant.com> wrote:
> Ok, this is already helpful. Do you have anything the log that shows
> what process pid 95528 is? Could you grep for it?

I haven't looked at this in detail, but are you sure that this isn't
the bug that was fixed by commit
e81e8f9342b037246b284bad15e42e21b1929301 ? The "invalid argument"
poll() error is too generic to be sure what is really at fault, but
this doesn't need to be a FATAL error, it seems.

Is this a 32-bit platform? It's probably just that there is a large
number of sockets, which Google seems to think can cause this error
with poll(), so this seems like a slight variant of the problem
reported by Sean Chittenden (kernel resource exhaustion causes pipe
creation to fail, with unnecessarily bad consequences for Postgres).

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 2012-11-18 12:29:29 -0500, Tom Lane wrote:
> Igor <igorya.inscriptio@gmail.com> writes:
> > I feel :) that the problem is in "logging_collector" setting - I used this
> > feature. But now I comments this setting in config. And will watching ..
>
> I was just about to ask what you have log_rotation_age set to ...

According to the manpages freebsds poll won't return EINVAL if a
contained fd is invalid... I wonder if the the timeout value is screwed
and overflowed the cast from 'long' to int and got negative.

Igor, I guess youre on a 64bit system?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Igor <igorya.inscriptio@gmail.com> writes:
> 2012/11/18 Tom Lane <tgl@sss.pgh.pa.us>
>> I was just about to ask what you have log_rotation_age set to ...

> Was:
> log_rotation_age = 30d

That's the problem then.  Internally that gets converted to
milliseconds, which'll overflow at 24-something days.  Use a smaller
value.

So the actual bug here is an over-optimistic maximum value for
log_rotation_age in guc.c.  After looking through all the callers of
WaitLatch, I think the only other similar issue is for
wal_sender_timeout, which currently is allowed to range up to INT_MAX;
but such values seem pretty useless.

I'm inclined to propose limiting both of these to the equivalent of 15
days.  Alternatively we could try to rejigger things to prevent asking
WaitLatch to wait for more than 2^31 msec, but it's not real clear to
me that it's worth the trouble.

            regards, tom lane
On 2012-11-18 17:36:40 +0000, Peter Geoghegan wrote:
> On 18 November 2012 17:12, Andres Freund <andres@2ndquadrant.com> wrote:
> > Ok, this is already helpful. Do you have anything the log that shows
> > what process pid 95528 is? Could you grep for it?
>
> I haven't looked at this in detail, but are you sure that this isn't
> the bug that was fixed by commit
> e81e8f9342b037246b284bad15e42e21b1929301 ? The "invalid argument"
> poll() error is too generic to be sure what is really at fault, but
> this doesn't need to be a FATAL error, it seems.

I looked at that commit at first as well, but I don't really see how it
could have fixed that issue.
Btw, its not a FATAL but a plain ERROR.

> Is this a 32-bit platform? It's probably just that there is a large
> number of sockets, which Google seems to think can cause this error
> with poll(), so this seems like a slight variant of the problem
> reported by Sean Chittenden (kernel resource exhaustion causes pipe
> creation to fail, with unnecessarily bad consequences for Postgres).

We never have more than 3 fds in the poll. And a failure to create the
pipe should have caused problems at other locations before this...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Peter Geoghegan <peter@2ndquadrant.com> writes:
> I haven't looked at this in detail, but are you sure that this isn't
> the bug that was fixed by commit
> e81e8f9342b037246b284bad15e42e21b1929301 ? The "invalid argument"
> poll() error is too generic to be sure what is really at fault, but
> this doesn't need to be a FATAL error, it seems.

He's running BSD, which is not at all wishy-washy about what it means:

[EINVAL]        The specified time limit is negative.

The reason it's FATAL is that it's happening in the syslogger process,
which has no error recovery capability.

            regards, tom lane
2012/11/18 Andres Freund <andres@2ndquadrant.com>

> On 2012-11-18 12:29:29 -0500, Tom Lane wrote:
> > Igor <igorya.inscriptio@gmail.com> writes:
> > > I feel :) that the problem is in "logging_collector" setting - I used
> this
> > > feature. But now I comments this setting in config. And will watching
> ..
> >
> > I was just about to ask what you have log_rotation_age set to ...
>
> According to the manpages freebsds poll won't return EINVAL if a
> contained fd is invalid... I wonder if the the timeout value is screwed
> and overflowed the cast from 'long' to int and got negative.
>
> Igor, I guess youre on a 64bit system?
>

Yes. FreeBSD 9.0-RELEASE amd64
And the problems of free resources on the server does not exist. And there
is minimal load.

--
Regards
System and web developer.
<http://amkmobile.com>
2012/11/18 Tom Lane <tgl@sss.pgh.pa.us>

> Igor <igorya.inscriptio@gmail.com> writes:
> > 2012/11/18 Tom Lane <tgl@sss.pgh.pa.us>
> >> I was just about to ask what you have log_rotation_age set to ...
>
> > Was:
> > log_rotation_age = 30d
>
> That's the problem then.  Internally that gets converted to
> milliseconds, which'll overflow at 24-something days.  Use a smaller
> value.
>
> So the actual bug here is an over-optimistic maximum value for
> log_rotation_age in guc.c.  After looking through all the callers of
> WaitLatch, I think the only other similar issue is for
> wal_sender_timeout, which currently is allowed to range up to INT_MAX;
> but such values seem pretty useless.
>
> I'm inclined to propose limiting both of these to the equivalent of 15
> days.


# Set this options to:
logging_collector = on
log_rotation_age = 15d
log_rotation_size = 100MB # was early

And will watch..

--
Regards
System and web developer.
<http://amkmobile.com>
On 2012-11-18 12:44:55 -0500, Tom Lane wrote:
> Igor <igorya.inscriptio@gmail.com> writes:
> > 2012/11/18 Tom Lane <tgl@sss.pgh.pa.us>
> >> I was just about to ask what you have log_rotation_age set to ...
>
> > Was:
> > log_rotation_age = 30d
>
> That's the problem then.  Internally that gets converted to
> milliseconds, which'll overflow at 24-something days.  Use a smaller
> value.

Nicely guessed.

> So the actual bug here is an over-optimistic maximum value for
> log_rotation_age in guc.c.  After looking through all the callers of
> WaitLatch, I think the only other similar issue is for
> wal_sender_timeout, which currently is allowed to range up to INT_MAX;
> but such values seem pretty useless.
>
> I'm inclined to propose limiting both of these to the equivalent of 15
> days.  Alternatively we could try to rejigger things to prevent asking
> WaitLatch to wait for more than 2^31 msec, but it's not real clear to
> me that it's worth the trouble.

In general I have no problem imposing lower limits, but it seems to be a
ugly to get errors for an invalid configuration file after a minor
version upgrade. While the wal_sender_timeout isn't really likely to be
that high I don't think the log_rotation_age one is unlikely to be set
to something in the month range, thats not an unreasonable value.

I suggest adding an Assert before the (int)timeout cast, but I guess you
already plan to do so.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-11-18 12:44:55 -0500, Tom Lane wrote:
>> I'm inclined to propose limiting both of these to the equivalent of 15
>> days.  Alternatively we could try to rejigger things to prevent asking
>> WaitLatch to wait for more than 2^31 msec, but it's not real clear to
>> me that it's worth the trouble.

> In general I have no problem imposing lower limits, but it seems to be a
> ugly to get errors for an invalid configuration file after a minor
> version upgrade. While the wal_sender_timeout isn't really likely to be
> that high I don't think the log_rotation_age one is unlikely to be set
> to something in the month range, thats not an unreasonable value.

Well, we have two reports of people trying such values (assuming that
#7545 actually is the same thing), and it didn't work for either of
them.  I don't think it's a problem to restrict the value to something
that will work rather than fail.

If you're worried that there's somebody out there using 20 or 21 days
as log_rotation_age, I guess we could limit to INT_MAX/1000 seconds or
something just less than that.

            regards, tom lane
2012/11/18 Tom Lane <tgl@sss.pgh.pa.us>

> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2012-11-18 12:44:55 -0500, Tom Lane wrote:
> >> I'm inclined to propose limiting both of these to the equivalent of 15
> >> days.  Alternatively we could try to rejigger things to prevent asking
> >> WaitLatch to wait for more than 2^31 msec, but it's not real clear to
> >> me that it's worth the trouble.
>
> > In general I have no problem imposing lower limits, but it seems to be a
> > ugly to get errors for an invalid configuration file after a minor
> > version upgrade. While the wal_sender_timeout isn't really likely to be
> > that high I don't think the log_rotation_age one is unlikely to be set
> > to something in the month range, thats not an unreasonable value.
>
> Well, we have two reports of people trying such values (assuming that
> #7545 actually is the same thing), and it didn't work for either of
> them.  I don't think it's a problem to restrict the value to something
> that will work rather than fail.
>
> If you're worried that there's somebody out there using 20 or 21 days
> as log_rotation_age, I guess we could limit to INT_MAX/1000 seconds or
> something just less than that.
>

Some experiments are necessary on my side ?

--
Regards
System and web developer.
<http://amkmobile.com>

Re: BUG #7670: BUG #7545: Unresponsive server with error log reporting: "poll() failed: Invalid argument"

От
Peter Geoghegan
Дата:
On 18 November 2012 18:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, we have two reports of people trying such values (assuming that
> #7545 actually is the same thing), and it didn't work for either of
> them.  I don't think it's a problem to restrict the value to something
> that will work rather than fail.

Right. sizeof(int) is very probably 4 on all platforms that we
support. I see no problem with the proposal.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Am 18.11.2012 um 19:36 schrieb Peter Geoghegan <peter@2ndquadrant.com>:

> On 18 November 2012 18:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, we have two reports of people trying such values (assuming that
>> #7545 actually is the same thing), and it didn't work for either of
>> them.  I don't think it's a problem to restrict the value to =
something
>> that will work rather than fail.
>=20
> Right. sizeof(int) is very probably 4 on all platforms that we
> support. I see no problem with the proposal.

Have you cross-checked this on a 64bit platform vs. a 32 bit platform?
e.g. on Linux i386 vs. Linux amd64?

>=20
> --=20
> Peter Geoghegan       http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training and Services
>=20
>=20
> --=20
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs


Freundliche Gr=FCsse,
micro systems

Marc Balmer

--
Marc Balmer
micro systems, Wiesendamm 2a, Postfach, 4019 Basel
fon +41 61 383 05 10, fax +41 61 383 05 12, http://www.msys.ch/
On 2012-11-18 13:18:42 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2012-11-18 12:44:55 -0500, Tom Lane wrote:
> >> I'm inclined to propose limiting both of these to the equivalent of 15
> >> days.  Alternatively we could try to rejigger things to prevent asking
> >> WaitLatch to wait for more than 2^31 msec, but it's not real clear to
> >> me that it's worth the trouble.
>
> > In general I have no problem imposing lower limits, but it seems to be a
> > ugly to get errors for an invalid configuration file after a minor
> > version upgrade. While the wal_sender_timeout isn't really likely to be
> > that high I don't think the log_rotation_age one is unlikely to be set
> > to something in the month range, thats not an unreasonable value.
>
> Well, we have two reports of people trying such values (assuming that
> #7545 actually is the same thing), and it didn't work for either of
> them.  I don't think it's a problem to restrict the value to something
> that will work rather than fail.

Thats a good point. But #7545 was on OSX so its not really that much
evidence a larger value doesn't work on non-bsdish systems.

In fact, a setting of 32 days seems not to cause any immediate problems
here on linux, even when choosing the timeout in a way it results in a
negative timeout value for poll. Not sure what its waiting for, but it
doesn't crash.

Greetings,

Andres Freund

Re: BUG #7670: BUG #7545: Unresponsive server with error log reporting: "poll() failed: Invalid argument"

От
Peter Geoghegan
Дата:
On 18 November 2012 18:42, Marc Balmer <marc@msys.ch> wrote:
> Have you cross-checked this on a 64bit platform vs. a 32 bit platform?
> e.g. on Linux i386 vs. Linux amd64?

Well, strictly speaking sizeof(int) is dictated by both the compiler
and CPU architecture in question. I believe that all current Unix-like
systems follow LLP64 (or LLP32) in practice. sizeof(int) is 4 on
Windows (though, in contrast to LLP64, sizeof(long) is 4 too on 64-bit
windows).

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-11-18 13:18:42 -0500, Tom Lane wrote:
>> Well, we have two reports of people trying such values (assuming that
>> #7545 actually is the same thing), and it didn't work for either of
>> them.  I don't think it's a problem to restrict the value to something
>> that will work rather than fail.

> Thats a good point. But #7545 was on OSX so its not really that much
> evidence a larger value doesn't work on non-bsdish systems.

> In fact, a setting of 32 days seems not to cause any immediate problems
> here on linux, even when choosing the timeout in a way it results in a
> negative timeout value for poll. Not sure what its waiting for, but it
> doesn't crash.

If you look closely at what's happening in syslogger.c, you'll realize
that it's a phase of the moon kind of behavior, because the
next_rotation_time is rounded off to a multiple of the specified
log_rotation_age.  Then you'll get an overflow (or not) depending on
whether now() is within 2^31 msec of that target time.  So for example
with a 30-day log_rotation_age, you'd be seeing failures during the
first five days of every "month", where a month is defined as
exactly-30-day intervals since the epoch.  And I think there's a
timezone correction in there too, which would shift the trouble
intervals around for different people.

Anyway, even if Linux for some reason doesn't reject negative values,
I think we need to limit the GUC's range so we don't try to use them.
Maybe you'd get sane behavior and maybe you wouldn't.  I notice that
the Single Unix Spec doesn't specify EINVAL for negative timeout
values, but it also doesn't define what happens for a negative value
other than -1.  So this is basically an unspecified case and it's
incumbent on us to not do it if we want portable behavior.

            regards, tom lane
On 2012-11-18 14:07:37 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2012-11-18 13:18:42 -0500, Tom Lane wrote:
> >> Well, we have two reports of people trying such values (assuming that
> >> #7545 actually is the same thing), and it didn't work for either of
> >> them.  I don't think it's a problem to restrict the value to something
> >> that will work rather than fail.
>
> > Thats a good point. But #7545 was on OSX so its not really that much
> > evidence a larger value doesn't work on non-bsdish systems.
>
> > In fact, a setting of 32 days seems not to cause any immediate problems
> > here on linux, even when choosing the timeout in a way it results in a
> > negative timeout value for poll. Not sure what its waiting for, but it
> > doesn't crash.
>
> If you look closely at what's happening in syslogger.c, you'll realize
> that it's a phase of the moon kind of behavior, because the
> next_rotation_time is rounded off to a multiple of the specified
> log_rotation_age.  Then you'll get an overflow (or not) depending on
> whether now() is within 2^31 msec of that target time.  So for example
> with a 30-day log_rotation_age, you'd be seeing failures during the
> first five days of every "month", where a month is defined as
> exactly-30-day intervals since the epoch.  And I think there's a
> timezone correction in there too, which would shift the trouble
> intervals around for different people.

I verified with strace that I calculated correctly and got a negative
value:

poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1033055888) = 1 ([{fd=10, revents=POLLIN}])

(that poll finished was due to a SIGHUP)

> Anyway, even if Linux for some reason doesn't reject negative values,
> I think we need to limit the GUC's range so we don't try to use them.
> Maybe you'd get sane behavior and maybe you wouldn't.  I notice that
> the Single Unix Spec doesn't specify EINVAL for negative timeout
> values, but it also doesn't define what happens for a negative value
> other than -1.  So this is basically an unspecified case and it's
> incumbent on us to not do it if we want portable behavior.

I aggree that we need to do something. I just think it might be enough
to clamp the timeout value passed to WaitLatchOrSocket to the maximum
valid value. The rest of SysLoggerMain seems to work correctly in that
case.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Marc Balmer <marc@msys.ch> writes:
> Am 18.11.2012 um 19:36 schrieb Peter Geoghegan <peter@2ndquadrant.com>:
>> Right. sizeof(int) is very probably 4 on all platforms that we
>> support. I see no problem with the proposal.

> Have you cross-checked this on a 64bit platform vs. a 32 bit platform?
> e.g. on Linux i386 vs. Linux amd64?

sizeof(int) is most definitely 4 on all platforms that we support.
I've lost count of how many places would fall over if that weren't the
case, but for sure the tuple packing code would have problems.

            regards, tom lane
Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-11-18 14:07:37 -0500, Tom Lane wrote:
>> Anyway, even if Linux for some reason doesn't reject negative values,
>> I think we need to limit the GUC's range so we don't try to use them.
>> Maybe you'd get sane behavior and maybe you wouldn't.  I notice that
>> the Single Unix Spec doesn't specify EINVAL for negative timeout
>> values, but it also doesn't define what happens for a negative value
>> other than -1.  So this is basically an unspecified case and it's
>> incumbent on us to not do it if we want portable behavior.

> I aggree that we need to do something. I just think it might be enough
> to clamp the timeout value passed to WaitLatchOrSocket to the maximum
> valid value. The rest of SysLoggerMain seems to work correctly in that
> case.

This seems a rather 64-bit-centric view of the world.  If "long" is
32 bits, the value will have overflowed before it ever gets to
WaitLatch.

I'm thinking it might be reasonable to put an Assert(timeout <= INT_MAX)
into WaitLatch, since we're already asserting timeout >= 0.  But adding
a run-time clamp is not sufficient to fix the problem, because if any
overflow happens it's happening upstream.

            regards, tom lane