Обсуждение: syslog enabled causes random hangs?

Поиск
Список
Период
Сортировка

syslog enabled causes random hangs?

От
"Arthur Ward"
Дата:
I'm encountering strange hangs in postgresql backends at random moments.
They seem to be associated with attempts to issue log entries via syslog.
I have run backtraces on the hung backends a few times, and they routinely
trace into system libraries where it looks like a stuck syslog call. So
far, I haven't had this problem with any other apps, so I'm thinking it's
a condition being aggravated by Postgres.

The machines I'm seeing this on are RedHat 8 and Gentoo Linux machines
(x86). We're running compiled-from-source 7.3.3 on both machines (by hand
on the RH8 machine, and built by Gentoo's package manager, because it's a
build-from-source distribution). On my Gentoo test boxes, it happens
occasionally using syslog-ng 1.6.0_rc3 and glibc 2.3.2-r1. On the RedHat 8
machine, it's sysklogd-1.4.1-10 and "glibc-devel-2.3.2-4.80.6". I switched
to not using syslog on the RH8 machine because it was hanging backends
quite regularly, and that's rather bad in production. :)

Has anyone else had problems using the syslog option?

I need to rebuild PostgreSQL on my test machine so I'll have unstripped
binaries, but once that's done, I'll bring on the stack traces when I get
another hang.



Re: syslog enabled causes random hangs?

От
Tom Lane
Дата:
"Arthur Ward" <award@dominionsciences.com> writes:
> I'm encountering strange hangs in postgresql backends at random moments.
> They seem to be associated with attempts to issue log entries via syslog.
> I have run backtraces on the hung backends a few times, and they routinely
> trace into system libraries where it looks like a stuck syslog call. So
> far, I haven't had this problem with any other apps, so I'm thinking it's
> a condition being aggravated by Postgres.

How verbose are your Postgres logging settings?

On most platforms I've dealt with, syslog drops messages on the floor if
it starts to get saturated.  It may be that the Linux implementation has
worse behavior than that under heavy load :-(.  In any case I'd suggest
filing a bug against syslog.  There's very little we can do about it if
the syslog library call hangs up.

Personally I find it more reliable to pipe the postmaster's stderr to
some sort of log-rotation program than to depend on syslog.  It seems
that the Apache folks have found the same, since they include a suitable
log-rotation filter in their distribution ...

            regards, tom lane

Re: syslog enabled causes random hangs?

От
"Mendola Gaetano"
Дата:
""Arthur Ward"" <award@dominionsciences.com> writes:
> I'm encountering strange hangs in postgresql backends at random moments.
> They seem to be associated with attempts to issue log entries via syslog.
> I have run backtraces on the hung backends a few times, and they routinely
> trace into system libraries where it looks like a stuck syslog call. So
> far, I haven't had this problem with any other apps, so I'm thinking it's
> a condition being aggravated by Postgres.
>
> The machines I'm seeing this on are RedHat 8 and Gentoo Linux machines
> (x86). We're running compiled-from-source 7.3.3 on both machines (by hand
> on the RH8 machine, and built by Gentoo's package manager, because it's a
> build-from-source distribution). On my Gentoo test boxes, it happens
> occasionally using syslog-ng 1.6.0_rc3 and glibc 2.3.2-r1. On the RedHat 8
> machine, it's sysklogd-1.4.1-10 and "glibc-devel-2.3.2-4.80.6". I switched
> to not using syslog on the RH8 machine because it was hanging backends
> quite regularly, and that's rather bad in production. :)
>
> Has anyone else had problems using the syslog option?
>
> I need to rebuild PostgreSQL on my test machine so I'll have unstripped
> binaries, but once that's done, I'll bring on the stack traces when I get
> another hang.


Try to put a "-" before the name of your log file, this say to syslogd to
not
flush each message on the disk, so your line should appear somethink like:

LOCAL0.*           -/var/log/postgresql.log


Regards
Gaetano Mendola



Re: syslog enabled causes random hangs?

От
"Arthur Ward"
Дата:
> "Arthur Ward" <award@dominionsciences.com> writes:
>> I'm encountering strange hangs in postgresql backends at random
>> moments. They seem to be associated with attempts to issue log entries
>> via syslog. I have run backtraces on the hung backends a few times,
>> and they routinely trace into system libraries where it looks like a
>> stuck syslog call. So far, I haven't had this problem with any other
>> apps, so I'm thinking it's a condition being aggravated by Postgres.
>
> How verbose are your Postgres logging settings?

Defaults. (single-user test machine -- not much activity) On what I was
doing today, it seemed to happen most often with the occurrance of a
checkpoint, which would normally log a message about recycling the
transaction log files.

> On most platforms I've dealt with, syslog drops messages on the floor if
> it starts to get saturated.  It may be that the Linux implementation has
> worse behavior than that under heavy load :-(.  In any case I'd suggest
> filing a bug against syslog.  There's very little we can do about it if
> the syslog library call hangs up.

After rebuilding to get a non-stripped copy of Postgresql, I did a little
searching on my backtraces. Some of the first hits are pointing at a
deadlock condition in glibc, although I'm not sure how or why this is
affecting PostgreSQL running independently of my app and its signal
handler. Maybe some more digging will enlighten me...

> Personally I find it more reliable to pipe the postmaster's stderr to
> some sort of log-rotation program than to depend on syslog.  It seems
> that the Apache folks have found the same, since they include a suitable
> log-rotation filter in their distribution ...

Considering deadlocks in the system libraries sound a bit scary to me, I'm
definitely convinced to change my development machine now... As I
originally wrote, I'd already changed the production machine.


FWIW, since I already went to the trouble (and for the sake of people
searching the archives in the future), here's what I was seeing in the
backtraces after rebuilding this afternoon:

This process:
19763 pts/2    S      0:00 postgres: checkpoint subprocess
Was stuck here:
(gdb) bt
#0  0x402cf077 in semop () from /lib/libc.so.6
(gdb)

And the other:
19720 pts/2    S      0:04 postgres: award Trucking [local] UPDATE
(gdb) bt
#0  0x4021cec6 in sigsuspend () from /lib/libc.so.6
#1  0x424b6218 in __pthread_wait_for_restart_signal ()
   from /lib/libpthread.so.0
#2  0x424b79a0 in __pthread_alt_lock () from /lib/libpthread.so.0
#3  0x424b4c17 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x402ca21c in vsyslog () from /lib/libc.so.6
#5  0x402c9d8f in syslog () from /lib/libc.so.6
#6  0x08150a57 in write_syslog ()
(gdb)




Re: syslog enabled causes random hangs?

От
Tom Lane
Дата:
"Arthur Ward" <award@dominionsciences.com> writes:
> FWIW, since I already went to the trouble (and for the sake of people
> searching the archives in the future), here's what I was seeing in the
> backtraces after rebuilding this afternoon:

> This process:
> 19763 pts/2    S      0:00 postgres: checkpoint subprocess
> Was stuck here:
> (gdb) bt
> #0  0x402cf077 in semop () from /lib/libc.so.6
> (gdb)

Hmm.  It's odd that gdb isn't showing a backtrace any deeper than that
--- are you sure you rebuilt with debug symbols enabled?  But anyway,
I'd say this process is just waiting for someone else to release a lock.
Likely the lock holder is this other process:

> And the other:
> 19720 pts/2    S      0:04 postgres: award Trucking [local] UPDATE
> (gdb) bt
> #0  0x4021cec6 in sigsuspend () from /lib/libc.so.6
> #1  0x424b6218 in __pthread_wait_for_restart_signal ()
>    from /lib/libpthread.so.0
> #2  0x424b79a0 in __pthread_alt_lock () from /lib/libpthread.so.0
> #3  0x424b4c17 in pthread_mutex_lock () from /lib/libpthread.so.0
> #4  0x402ca21c in vsyslog () from /lib/libc.so.6
> #5  0x402c9d8f in syslog () from /lib/libc.so.6
> #6  0x08150a57 in write_syslog ()
> (gdb)

This seems worthy of an inquiry to the glibc maintainers, or at least
some research to find out what syslog() could block on.

            regards, tom lane

Re: syslog enabled causes random hangs?

От
award@dominionsciences.com
Дата:
>> And the other:
>> 19720 pts/2    S      0:04 postgres: award Trucking [local] UPDATE
>> (gdb) bt
>> #0  0x4021cec6 in sigsuspend () from /lib/libc.so.6
>> #1  0x424b6218 in __pthread_wait_for_restart_signal ()
>>    from /lib/libpthread.so.0
>> #2  0x424b79a0 in __pthread_alt_lock () from /lib/libpthread.so.0
>> #3  0x424b4c17 in pthread_mutex_lock () from /lib/libpthread.so.0
>> #4  0x402ca21c in vsyslog () from /lib/libc.so.6
>> #5  0x402c9d8f in syslog () from /lib/libc.so.6
>> #6  0x08150a57 in write_syslog ()
>> (gdb)
>
> This seems worthy of an inquiry to the glibc maintainers, or at least
> some research to find out what syslog() could block on.

I've been digging, and the best information I've been able to come up with
is that there may be an unresolved glibc bug involving syslog and threads
dating back to at least 2001, or that glibc's syslog() is not re-entrant
and deadlocks when called from inside a signal handler when already
executing. Neither of those make a whole lot of sense to me, though:

1. The Postgresql backend shouldn't be getting any signals (at least none
external to Postgresql).

2. Postgresql doesn't use threads, right? That occurred to me this
morning, and I'm a little disturbed by the appearance of libpthread in the
backtrace.

The disgusting thing is that all the mail threads I've followed either
dead-end, or mention creating a signal-safe and/or thread-safe wrapper
around syslog inside to the calling application. Yuck! The only
information I've found which was particularly enlightening was a mention
that POSIX doesn't specify that syslog() should be reentrant, and
(according to the poster) glibc is not and does not specify (which I
confirmed in the on-line manual).

Blah. I'm ready to chalk this one up to glibc inadequacies and put it
behind me now, especially when the simple adjustment of
pipe-to-log-rotator does just fine.

Re: syslog enabled causes random hangs?

От
Tom Lane
Дата:
award@dominionsciences.com writes:
> Neither of those make a whole lot of sense to me, though:
> 1. The Postgresql backend shouldn't be getting any signals (at least none
> external to Postgresql).

Hm.  Postgres does sometimes issue elog messages from inside a signal
handler.  I doubt that it would do so while interrupting a non-interrupt
elog issuance, but wouldn't swear to it.

> 2. Postgresql doesn't use threads, right? That occurred to me this
> morning, and I'm a little disturbed by the appearance of libpthread in the
> backtrace.

Postgres doesn't, but that doesn't mean that your glibc wouldn't
internally assume that it might be in a threaded app.  See the recent
flamewar about threaded vs unthreaded libc in pgsql-hackers.

It's really too bad that your gdb backtrace didn't show anything past
the write_syslog level (which is an elog subroutine).  If we could see
whether the elog had been issued from a signal handler, and if so what
it had interrupted, we'd have an idea whether this is a known syslog
deficiency or not.  Do you want to rebuild with debug symbols and try
the backtrace again?

> The disgusting thing is that all the mail threads I've followed either
> dead-end, or mention creating a signal-safe and/or thread-safe wrapper
> around syslog inside to the calling application. Yuck!

Agreed, but we might be looking at needing to do that in Postgres, if
this turns out to be something that the glibc boys know about and won't
fix.

            regards, tom lane

Re: syslog enabled causes random hangs?

От
award@dominionsciences.com
Дата:
> It's really too bad that your gdb backtrace didn't show anything past
> the write_syslog level (which is an elog subroutine).  If we could see
> whether the elog had been issued from a signal handler, and if so what
> it had interrupted, we'd have an idea whether this is a known syslog
> deficiency or not.  Do you want to rebuild with debug symbols and try
> the backtrace again?

Since you think it might be necessary to add a workaround in Postgresql,
I'll try it again when I install 7.3.4 on my test machine. That will
probably be in a few days. I need to get back to real work again. :-)

I'm sure I rebuilt 7.3.3 with the symbols left in place. Off the top of
your head do you know an easy way to tell if a binary still has symbols?

Re: syslog enabled causes random hangs?

От
"Arthur Ward"
Дата:
I'm back with more on the funky glibc-syslog-Postgres deadlocking behavior:

> It's really too bad that your gdb backtrace didn't show anything past
> the write_syslog level (which is an elog subroutine).  If we could see
> whether the elog had been issued from a signal handler, and if so what
> it had interrupted, we'd have an idea whether this is a known syslog
> deficiency or not.  Do you want to rebuild with debug symbols and try
> the backtrace again?

A minor ah-ha here: Whoever wrote the original Gentoo build for Postgres
left out --enable-debug, even though Gentoo has a system-wide flag for
"no, I don't want symbols" (which I did not set).

I was testing a different piece of my app at home, and triggered the
deadlock by accident. Being frustrated, I took the time to rebuild PG with
symbols, and also rebuilt Python (for other reasons), which accounts for
the RExec exception in the traceback here. The exception normally comes
back to the client with no problems. I had forgotten to comment out the
poison line in Python 2.2.3's RExec module so my plpython code would
continue working.

This is Postgresql 7.3.4 running against glibc 2.3.2. It just occured to
me that I didn't trace the postmaster or the other always-on processes,
just the two dead backends. In a couple of attempts to recreate the
problem again, I had no failures. :-( The original problem was on a
different machine running a different part of my project, and seemed more
reproducible, or I could have just had a bad day, so I can go back to that
at some point if tracing the other processes is a good idea.

postgres 27259  3.2  6.4 47488 20600 pts/0   S    03:30   0:05 postgres:
postgres Trucking [local] VACUUM
(gdb) bt
#0  0x4dbf81a7 in semop () from /lib/libc.so.6
#1  0x080f2dd7 in PGSemaphoreLock (sema=0x1, interruptOK=1 '\001') at
pg_sema.c:434
#2  0x08106dd3 in ProcWaitForSignal () at proc.c:895
#3  0x08101878 in LockBufferForCleanup (buffer=1021) at bufmgr.c:1946
#4  0x080c380b in lazy_vacuum_heap (onerel=0x420a20f8,
vacrelstats=0x8288f88) at vacuumlazy.c:474
#5  0x080c3476 in lazy_scan_heap (onerel=0x420a20f8,
vacrelstats=0x8288f88, Irel=0x8297818, nindexes=3) at vacuumlazy.c:426
#6  0x080c31e2 in lazy_vacuum_rel (onerel=0x21, vacstmt=0x1) at
vacuumlazy.c:158
#7  0x080c02ee in vacuum_rel (relid=1021, vacstmt=0x8287138,
expected_relkind=114 'r') at vacuum.c:829
#8  0x080bfd32 in vacuum (vacstmt=0x8287138) at vacuum.c:290
#9  0x0810a3a1 in pg_exec_query_string (query_string=0x8287138,
dest=Remote, parse_context=0x827bb30) at postgres.c:789
#10 0x0810afdf in PostgresMain (argc=4, argv=0xbfff7cc8,
username=0x822e1d9 "postgres") at postgres.c:2013
#11 0x080f5175 in DoBackend (port=0x822e0a8) at postmaster.c:2310
#12 0x080f4d2f in BackendStartup (port=0x822e0a8) at postmaster.c:1932
#13 0x080f3f95 in ServerLoop () at postmaster.c:1009
#14 0x080f3989 in PostmasterMain (argc=1, argv=0x82162e0) at postmaster.c:788
#15 0x080d684c in main (argc=1, argv=0xbfff85c4) at main.c:210
#16 0x4db327a7 in __libc_start_main () from /lib/libc.so.6

postgres 27235  5.7  3.0 41900 9784 pts/0    S    03:29   0:15 postgres:
award Trucking [local] INSERT
(gdb) bt
#0  0x4db45ed6 in sigsuspend () from /lib/libc.so.6
#1  0x4de40218 in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
#2  0x4de419a0 in __pthread_alt_lock () from /lib/libpthread.so.0
#3  0x4de3ec17 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x4dbf334c in vsyslog () from /lib/libc.so.6
#5  0x4dbf2ebf in syslog () from /lib/libc.so.6
#6  0x0814a49b in write_syslog (level=4,
    line=0xbfff70fc "ERROR:  plpython: Unable to create rexec.RExec
instance\nexceptions.RuntimeError: This code is not secure in Python
2.2 and 2.3") at elog.c:728
#7  0x08149bf9 in elog (lev=-1073779172, fmt=0x421a44fa "plpython:
%s\n%s") at elog.c:383
#8  0x4214b496 in PLy_elog (elevel=20, fmt=0x421a4a80 "Unable to create
rexec.RExec instance") at plpython.c:2811
#9  0x421490ff in PLy_procedure_compile (proc=0x83abdb8,
    src=0x8359610 "\n# TD[\"args\"][0] - audit_table (with schema) \n#
TD[\"relid\"] = OID of the table being triggered on\n\nif not
SD.has_key(\"plans\"):\n\tSD[\"plans\"] = {}\n\nif
SD[\"plans\"].has_key(TD[\"relid\"]):\n\tcachedPlans = SD"...)
    at plpython.c:1203
#10 0x42148f32 in PLy_procedure_create (fcinfo=0xbfff78c0, is_trigger=1
'\001', procTup=0x420b1910,
    key=0xbfff7730 "5210458_trigger") at plpython.c:1177
#11 0x42148ba8 in PLy_procedure_get (fcinfo=0xbfff78c0, is_trigger=1
'\001') at plpython.c:1043
#12 0x4214758e in plpython_call_handler (fcinfo=0xbfff78c0) at plpython.c:423
#13 0x0814b4cc in fmgr_security_definer (fcinfo=0xbfff78c0) at fmgr.c:681
#14 0x080bbb8f in ExecCallTriggerFunc (trigdata=0xbfff79e8,
finfo=0x82aff98, per_tuple_context=0x827c9d0) at trigger.c:1121
#15 0x080bc39a in DeferredTriggerExecute (event=0x82c2ba8, itemno=-4,
rel=0x4f815b, trigdesc=0x828bff8, finfo=0x82afcf0,
    per_tuple_context=0x827c9d0) at trigger.c:1656
#16 0x080bc685 in deferredTriggerInvokeEvents (immediate_only=1 '\001') at
trigger.c:1804
#17 0x080bc7fb in DeferredTriggerEndQuery () at trigger.c:1964
#18 0x0810a4c7 in finish_xact_command (forceCommit=0 '\0') at postgres.c:970
#19 0x0810a287 in pg_exec_query_string (query_string=0xbfff7aac,
dest=Remote, parse_context=0x827bb30) at postgres.c:897
#20 0x0810afdf in PostgresMain (argc=4, argv=0xbfff7cc8,
username=0x822e1d9 "award") at postgres.c:2013
#21 0x080f5175 in DoBackend (port=0x822e0a8) at postmaster.c:2310
#22 0x080f4d2f in BackendStartup (port=0x822e0a8) at postmaster.c:1932
#23 0x080f3f95 in ServerLoop () at postmaster.c:1009
#24 0x080f3989 in PostmasterMain (argc=1, argv=0x82162e0) at postmaster.c:788
#25 0x080d684c in main (argc=1, argv=0xbfff85c4) at main.c:210
#26 0x4db327a7 in __libc_start_main () from /lib/libc.so.6




Re: syslog enabled causes random hangs?

От
Tom Lane
Дата:
"Arthur Ward" <award@dominionsciences.com> writes:
> I'm back with more on the funky glibc-syslog-Postgres deadlocking behavior:

It looks to me like the guy doing VACUUM is simply waiting for the other
guy to release a page-level lock.  The other guy is running a deferred
trigger and so I'd expect him to be holding one or two page-level locks,
on the page or pages containing the tuple or tuples passed to the
trigger.  Nothing evidently wrong there.

The real question is why does vsyslog() have anything to block on, when
it's running in an unthreaded process?  Seeing that you are using
plpython, I wonder if Python is confusing matters somehow.

            regards, tom lane

Re: syslog enabled causes random hangs?

От
"Arthur Ward"
Дата:
> It looks to me like the guy doing VACUUM is simply waiting for the other
> guy to release a page-level lock.  The other guy is running a deferred
> trigger and so I'd expect him to be holding one or two page-level locks,
> on the page or pages containing the tuple or tuples passed to the
> trigger.  Nothing evidently wrong there.

If I remember what I was working on the other day when this whole thing
started, I think it was a single backend and a checkpoint that collided.
I'll trace that combination, assuming it happens again.

> The real question is why does vsyslog() have anything to block on, when
> it's running in an unthreaded process?  Seeing that you are using
> plpython, I wonder if Python is confusing matters somehow.

Oof. I'm using plpython all over the place; I don't think this problem has
happened in any location that can work without it easily. :-/




Re: syslog enabled causes random hangs?

От
Tom Lane
Дата:
"Arthur Ward" <award@dominionsciences.com> writes:
>> It looks to me like the guy doing VACUUM is simply waiting for the other
>> guy to release a page-level lock.  The other guy is running a deferred
>> trigger and so I'd expect him to be holding one or two page-level locks,
>> on the page or pages containing the tuple or tuples passed to the
>> trigger.  Nothing evidently wrong there.

> If I remember what I was working on the other day when this whole thing
> started, I think it was a single backend and a checkpoint that collided.
> I'll trace that combination, assuming it happens again.

A checkpoint would also have reason to wait for a page-level lock, if
the stuck backend was holding one.  I am wondering though if the stuck
condition consistently happens while trying to fire a trigger?  That
would be very interesting ... not sure what it'd mean though ...

>> The real question is why does vsyslog() have anything to block on, when
>> it's running in an unthreaded process?  Seeing that you are using
>> plpython, I wonder if Python is confusing matters somehow.

> Oof. I'm using plpython all over the place; I don't think this problem has
> happened in any location that can work without it easily. :-/

It looks to me like your plpython code is all dead in the water, seeing
that your Python installation is refusing creation of rexec.  (AFAIK the
only workaround is to downgrade Python to a version that allows rexec.)
If you're using it all over the place, how come you haven't noticed
that??

            regards, tom lane

Re: syslog enabled causes random hangs?

От
"Arthur Ward"
Дата:
> A checkpoint would also have reason to wait for a page-level lock, if
> the stuck backend was holding one.  I am wondering though if the stuck
> condition consistently happens while trying to fire a trigger?  That
> would be very interesting ... not sure what it'd mean though ...

Hmm. I'm really at a loss as to how I could test this theory. I don't have
a truly reproducible test case for any of these syslog problems, just one
set of code with really bad luck one afternoon. Race conditions (or things
that smell a lot like them) stink.

> It looks to me like your plpython code is all dead in the water, seeing
> that your Python installation is refusing creation of rexec.  (AFAIK the
> only workaround is to downgrade Python to a version that allows rexec.)
> If you're using it all over the place, how come you haven't noticed
> that??

I did notice, and it was an oversight. I had just rebuilt Python 2.2.3
(for unrelated reasons) and forgot to comment out the poision line in
rexec.py where they raise the "stop using RExec" exception. It behaves
properly once that line is commented out (properly being it works same as
with earlier versions of Python; it appears all that changed was the
insertion of the exception at the beginning of the RExec constructor). I
tried to get the deadlock behavior again after fixing rexec.py, but my
luck wasn't bad enough for another three runs, so I posted the case I had
traces for.

An idea just popped into my head, though. Perhaps I can create procedures
in plpgsql and plpython which do nothing but spew notices (which would in
turn be sent to syslog), and run one or two copies to see if they'll die
on their own given sufficient time. That seems worthwhile, especially if I
can get a deadlock in plpgsql, since it will take the blame away from both
triggers and plpython. Does this sound like a reasonable experiment? I may
try setting this up on my home machine tonight to run for a few days.




Re: syslog enabled causes random hangs?

От
Tom Lane
Дата:
"Arthur Ward" <award@dominionsciences.com> writes:
> An idea just popped into my head, though. Perhaps I can create procedures
> in plpgsql and plpython which do nothing but spew notices (which would in
> turn be sent to syslog), and run one or two copies to see if they'll die
> on their own given sufficient time. That seems worthwhile, especially if I
> can get a deadlock in plpgsql, since it will take the blame away from both
> triggers and plpython. Does this sound like a reasonable experiment?

Sure.

Note it seems possible that the failure occurs only during a particular
backend's first or second attempt to send something to syslog (since the
first attempt would do openlog).  So you should probably run a test that
involves sending only a few messages per session, rather than vast
numbers of messages from one session.

            regards, tom lane