Обсуждение: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)

Поиск
Список
Период
Сортировка

some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)

От
john gale
Дата:
We are running a Postgres 9.3.2 server on a Mac Pro running OS X 10.9.0.  As an automation results server it's heavily
insertedinto (30-80 queries per sec), periodically selected from (once every few minutes). 

We created this instance with Postgres 9.3.0 and after a time ran into an issue where some SELECTs that usually
completedin 2s were deadlocking, still around hours or days later.  I read something that seemed related to this issue
underthe release notes for 9.3.2 although can't find it again now.  After upgrading to 9.3.2, the issue went away for a
month.

Now we are seeing this again, and I am wondering if this is a Postgres 9.3.* issue or possibly a Mac OS X issue.

The symptoms:

Running 'ps waux' shows a slowly increasing number of postgres processes in SELECT.  Retrieving the actual SELECT from
pg_stat_activityand running it manually returns with correct results within a few seconds.  The query_start time for
thesestuck queries is hours back.  As an example: 

db=# select pid, backend_start, query_start, state_change, waiting, state, query from pg_stat_activity where query like
'%SELECT%'order by query_start; 
  pid  |         backend_start         |          query_start          |         state_change          | waiting |
state | query



                                        
 22208 | 2014-02-13 13:47:11.846371+00 | 2014-02-13 13:47:11.852527+00 | 2014-02-13 13:47:11.852542+00 | f       |
active| SELECT *, testruns.id FROM testruns JOIN test_types ON testruns.test_type_id=test_types.id WHERE
testruns.custom_data->'SpawnID'= 'SpawnID-413873596.233222' AND testruns.started_at > '2014-02-11 04:53:16.233222' 

You can see the query_start time of 5:47am localtime, and it's now 4:28pm, so the query has been in this state for
almost11 hours. 

A sample (or spindump rather, which is more detailed) shows the postmaster process responding to this SELECT as waiting
in__sendto in libsystem_kernal.dylib: 

Process:         postmaster [22208]
Path:            /Library/PostgreSQL/9.3/bin/postmaster
Architecture:    x86_64
Parent:          postmaster [1051]
Responsible:     postmaster [1051]
UID:             502
Task size:       7930 pages
Importance:      Donating
IO Policy:       Standard

  Thread 0x7b3205   DispatchQueue 1          priority 31
  1002 start + 52 (postmaster) [0x1000028c4]
    1002 main + 667 (postmaster) [0x100285acb]
      1002 PostmasterMain + 6970 (postmaster) [0x10033ca1a]
        1002 ServerLoop + 707 (postmaster) [0x10033d763]
          1002 BackendStartup + 461 (postmaster) [0x100341fcd]
            1002 BackendRun + 680 (postmaster) [0x100342b98]
              1002 PostgresMain + 2396 (postmaster) [0x1003bb41c]
                1002 exec_simple_query + 1382 (postmaster) [0x1003b57e6]
                  1002 PortalRun + 702 (postmaster) [0x1003bd95e]
                    1002 PortalRunSelect + 235 (postmaster) [0x1003bdd8b]
                      1002 ExecutorRun + 88 (postmaster) [0x100225a58]
                        1002 standard_ExecutorRun + 314 (postmaster) [0x100225b9a]
                          1002 ExecutePlan + 214 (postmaster) [0x1002281e6]
                            1002 printtup + 554 (postmaster) [0x1000090da]
                              1002 pq_endmessage + 58 (postmaster) [0x10028493a]
                                1002 pq_putmessage + 187 (postmaster) [0x100283d5b]
                                  1002 internal_putbytes + 54 (postmaster) [0x100283896]
                                    1002 internal_flush + 98 (postmaster) [0x100283a92]
                                      1002 __sendto + 10 (libsystem_kernel.dylib) [0x7fff8ac79a9a]
                                       *1002 hndl_unix_scall64 + 22 (mach_kernel) [0xffffff80002f3e06]
                                         *1002 unix_syscall64 + 499 (mach_kernel) [0xffffff800063de23]
                                           *1002 sendto_nocancel + 177 (mach_kernel) [0xffffff800061f531]
                                             *1002 ??? (mach_kernel + 4323232) [0xffffff800061f7a0]
                                               *1002 sosend + 634 (mach_kernel) [0xffffff80006139aa]
                                                 *1002 sosendcheck + 583 (mach_kernel) [0xffffff8000613647]
                                                   *1002 sbwait + 297 (mach_kernel) [0xffffff800061a939]
                                                     *1002 msleep + 114 (mach_kernel) [0xffffff80005e1302]
                                                       *1002 ??? (mach_kernel + 4067541) [0xffffff80005e10d5]
                                                         *1002 lck_mtx_sleep + 78 (mach_kernel) [0xffffff800022d17e]
                                                           *1002 thread_block_reason + 204 (mach_kernel)
[0xffffff8000235d4c]
                                                             *1002 ??? (mach_kernel + 223563) [0xffffff800023694b]
                                                               *1002 machine_switch_context + 354 (mach_kernel)
[0xffffff80002d81d2]

The client that called this SELECT is waiting around for network traffic:

Process:         Python [20954]
Path:            /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
Architecture:    x86_64
Parent:          Python [20951]
UID:             501
Task size:       12802 pages

  Thread 0x6ab05f   DispatchQueue 1          priority 31
  1001 start + 1 (libdyld.dylib) [0x7fff884287e1]
    1001 Py_Main + 2715 (Python) [0x10ace0e27]
      1001 PyRun_SimpleFileExFlags + 410 (Python) [0x10acbc726]
        1001 PyRun_FileExFlags + 165 (Python) [0x10acbcd3c]
          1001 ??? (Python + 339056) [0x10acbcc70]
            1001 PyEval_EvalCode + 54 (Python) [0x10ac809b3]
              1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                  1001 ??? (Python + 121055) [0x10ac878df]
                    1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                      1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                        1001 ??? (Python + 121055) [0x10ac878df]
                          1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                            1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                              1001 ??? (Python + 121055) [0x10ac878df]
                                1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                  1001 PyEval_EvalFrameEx + 3573 (Python) [0x10ac81f82]
                                    1001 PyObject_Call + 97 (Python) [0x10ac796c6]
                                      1001 ??? (Python + 331130) [0x10acbad7a]
                                        1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                          1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                                            1001 ??? (Python + 121055) [0x10ac878df]
                                              1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                                1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                                                  1001 ??? (Python + 121055) [0x10ac878df]
                                                    1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                                      1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                                                        1001 ??? (Python + 121055) [0x10ac878df]
                                                          1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                                            1001 PyEval_EvalFrameEx + 9244 (Python) [0x10ac835a9]
                                                              1001 pg_query + 142 (_pg.so) [0x10b1e2efe]
                                                                1001 PQexecFinish + 41 (libpq.5.dylib) [0x10b1f7b02]
                                                                  1001 PQgetResult + 136 (libpq.5.dylib) [0x10b1f783d]
                                                                    1001 pqWaitTimed + 25 (libpq.5.dylib) [0x10b1f9e85]
                                                                      1001 poll + 10 (libsystem_kernel.dylib)
[0x7fff8abe1f96]
                                                                       *1001 hndl_unix_scall64 + 19 (mach_kernel)
[0xffffff80002cdd23]
                                                                         *1001 unix_syscall64 + 522 (mach_kernel)
[0xffffff80005e063a]
                                                                           *1001 poll_nocancel + 798 (mach_kernel)
[0xffffff80005778de]
                                                                             *1001 kqueue_scan + 187 (mach_kernel)
[0xffffff800054e1fb]
                                                                               *1001 thread_block_reason + 300
(mach_kernel)[0xffffff800022da0c] 
                                                                                 *1001 ??? (mach_kernel + 190273)
[0xffffff800022e741]
                                                                                   *1001 machine_switch_context + 366
(mach_kernel)[0xffffff80002b403e] 


The python pglib clients issuing these calls have been used for years, and we did not see this issue with Postgres
9.0.4,so I don't believe the clients are at fault. 

We did not see this issue for years while running Postgres 9.0.4, however we were also on an earlier Mac OS version as
well,so there's no clear upgrade path that I can blame. 

I see no system.log messages at the time of this issue.  We have bumped the shared memory and open file limits on this
machineand seem to be well within them at the moment.  Other postgres traffic is churning along fine from the other
postmasters;it is only these growing few that seem stuck. 

Does this sound related to anything that might have been attempted to be fixed in 9.3.2 ?  Does the sample backtrace
looklike it could be a Postgres issue, or could it be a Mac OS X / unix network issue ? 

thanks for any help you can give.

    ~ john

Re: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)

От
Tom Lane
Дата:
john gale <john@smadness.com> writes:
> We are running a Postgres 9.3.2 server on a Mac Pro running OS X 10.9.0.  As an automation results server it's
heavilyinserted into (30-80 queries per sec), periodically selected from (once every few minutes). 
> We created this instance with Postgres 9.3.0 and after a time ran into an issue where some SELECTs that usually
completedin 2s were deadlocking, still around hours or days later.  I read something that seemed related to this issue
underthe release notes for 9.3.2 although can't find it again now.  After upgrading to 9.3.2, the issue went away for a
month.

I don't see any reason to think that's a deadlock.  The stack trace shows
the backend is waiting to send data to its client.  Since the client also
appears to be waiting for data to come in, it seems like this represents
some kind of network-stack bug.  You might try to get Apple interested.
Although the first thing they're gonna ask you is if 10.9.1 fixes it.

            regards, tom lane

Re: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)

От
Greg Stark
Дата:
On Fri, Feb 14, 2014 at 12:48 AM, john gale <john@smadness.com> wrote:
> Does this sound related to anything that might have been attempted to be
> fixed in 9.3.2 ?  Does the sample backtrace look like it could be a Postgres
> issue, or could it be a Mac OS X / unix network issue ?

It sounds like a network issue to me. Is there a stateful firewall or
NAT router between the client and the database? Is it possible it's
timing out mappings?

The traces indicate the database is blocked trying to write data to
the client and the client is waiting for more data from the server. To
avoid having connections pile up indefinitely you could try enabling
tcp keepalives which are ironically named because their main purpose
is to kill connections that are stuck in a state like this for too
long. But that won't fix the underlying problem.


--
greg