some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)

Поиск
Список
Период
Сортировка
От john gale
Тема some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)
Дата
Msg-id C8F8C964-F3C7-491A-9D88-5EF55BC00243@smadness.com
обсуждение исходный текст
Ответы Re: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)  (Greg Stark <stark@mit.edu>)
Список pgsql-bugs
We are running a Postgres 9.3.2 server on a Mac Pro running OS X 10.9.0.  As an automation results server it's heavily
insertedinto (30-80 queries per sec), periodically selected from (once every few minutes). 

We created this instance with Postgres 9.3.0 and after a time ran into an issue where some SELECTs that usually
completedin 2s were deadlocking, still around hours or days later.  I read something that seemed related to this issue
underthe release notes for 9.3.2 although can't find it again now.  After upgrading to 9.3.2, the issue went away for a
month.

Now we are seeing this again, and I am wondering if this is a Postgres 9.3.* issue or possibly a Mac OS X issue.

The symptoms:

Running 'ps waux' shows a slowly increasing number of postgres processes in SELECT.  Retrieving the actual SELECT from
pg_stat_activityand running it manually returns with correct results within a few seconds.  The query_start time for
thesestuck queries is hours back.  As an example: 

db=# select pid, backend_start, query_start, state_change, waiting, state, query from pg_stat_activity where query like
'%SELECT%'order by query_start; 
  pid  |         backend_start         |          query_start          |         state_change          | waiting |
state | query



                                        
 22208 | 2014-02-13 13:47:11.846371+00 | 2014-02-13 13:47:11.852527+00 | 2014-02-13 13:47:11.852542+00 | f       |
active| SELECT *, testruns.id FROM testruns JOIN test_types ON testruns.test_type_id=test_types.id WHERE
testruns.custom_data->'SpawnID'= 'SpawnID-413873596.233222' AND testruns.started_at > '2014-02-11 04:53:16.233222' 

You can see the query_start time of 5:47am localtime, and it's now 4:28pm, so the query has been in this state for
almost11 hours. 

A sample (or spindump rather, which is more detailed) shows the postmaster process responding to this SELECT as waiting
in__sendto in libsystem_kernal.dylib: 

Process:         postmaster [22208]
Path:            /Library/PostgreSQL/9.3/bin/postmaster
Architecture:    x86_64
Parent:          postmaster [1051]
Responsible:     postmaster [1051]
UID:             502
Task size:       7930 pages
Importance:      Donating
IO Policy:       Standard

  Thread 0x7b3205   DispatchQueue 1          priority 31
  1002 start + 52 (postmaster) [0x1000028c4]
    1002 main + 667 (postmaster) [0x100285acb]
      1002 PostmasterMain + 6970 (postmaster) [0x10033ca1a]
        1002 ServerLoop + 707 (postmaster) [0x10033d763]
          1002 BackendStartup + 461 (postmaster) [0x100341fcd]
            1002 BackendRun + 680 (postmaster) [0x100342b98]
              1002 PostgresMain + 2396 (postmaster) [0x1003bb41c]
                1002 exec_simple_query + 1382 (postmaster) [0x1003b57e6]
                  1002 PortalRun + 702 (postmaster) [0x1003bd95e]
                    1002 PortalRunSelect + 235 (postmaster) [0x1003bdd8b]
                      1002 ExecutorRun + 88 (postmaster) [0x100225a58]
                        1002 standard_ExecutorRun + 314 (postmaster) [0x100225b9a]
                          1002 ExecutePlan + 214 (postmaster) [0x1002281e6]
                            1002 printtup + 554 (postmaster) [0x1000090da]
                              1002 pq_endmessage + 58 (postmaster) [0x10028493a]
                                1002 pq_putmessage + 187 (postmaster) [0x100283d5b]
                                  1002 internal_putbytes + 54 (postmaster) [0x100283896]
                                    1002 internal_flush + 98 (postmaster) [0x100283a92]
                                      1002 __sendto + 10 (libsystem_kernel.dylib) [0x7fff8ac79a9a]
                                       *1002 hndl_unix_scall64 + 22 (mach_kernel) [0xffffff80002f3e06]
                                         *1002 unix_syscall64 + 499 (mach_kernel) [0xffffff800063de23]
                                           *1002 sendto_nocancel + 177 (mach_kernel) [0xffffff800061f531]
                                             *1002 ??? (mach_kernel + 4323232) [0xffffff800061f7a0]
                                               *1002 sosend + 634 (mach_kernel) [0xffffff80006139aa]
                                                 *1002 sosendcheck + 583 (mach_kernel) [0xffffff8000613647]
                                                   *1002 sbwait + 297 (mach_kernel) [0xffffff800061a939]
                                                     *1002 msleep + 114 (mach_kernel) [0xffffff80005e1302]
                                                       *1002 ??? (mach_kernel + 4067541) [0xffffff80005e10d5]
                                                         *1002 lck_mtx_sleep + 78 (mach_kernel) [0xffffff800022d17e]
                                                           *1002 thread_block_reason + 204 (mach_kernel)
[0xffffff8000235d4c]
                                                             *1002 ??? (mach_kernel + 223563) [0xffffff800023694b]
                                                               *1002 machine_switch_context + 354 (mach_kernel)
[0xffffff80002d81d2]

The client that called this SELECT is waiting around for network traffic:

Process:         Python [20954]
Path:            /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
Architecture:    x86_64
Parent:          Python [20951]
UID:             501
Task size:       12802 pages

  Thread 0x6ab05f   DispatchQueue 1          priority 31
  1001 start + 1 (libdyld.dylib) [0x7fff884287e1]
    1001 Py_Main + 2715 (Python) [0x10ace0e27]
      1001 PyRun_SimpleFileExFlags + 410 (Python) [0x10acbc726]
        1001 PyRun_FileExFlags + 165 (Python) [0x10acbcd3c]
          1001 ??? (Python + 339056) [0x10acbcc70]
            1001 PyEval_EvalCode + 54 (Python) [0x10ac809b3]
              1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                  1001 ??? (Python + 121055) [0x10ac878df]
                    1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                      1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                        1001 ??? (Python + 121055) [0x10ac878df]
                          1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                            1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                              1001 ??? (Python + 121055) [0x10ac878df]
                                1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                  1001 PyEval_EvalFrameEx + 3573 (Python) [0x10ac81f82]
                                    1001 PyObject_Call + 97 (Python) [0x10ac796c6]
                                      1001 ??? (Python + 331130) [0x10acbad7a]
                                        1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                          1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                                            1001 ??? (Python + 121055) [0x10ac878df]
                                              1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                                1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                                                  1001 ??? (Python + 121055) [0x10ac878df]
                                                    1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                                      1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
                                                        1001 ??? (Python + 121055) [0x10ac878df]
                                                          1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
                                                            1001 PyEval_EvalFrameEx + 9244 (Python) [0x10ac835a9]
                                                              1001 pg_query + 142 (_pg.so) [0x10b1e2efe]
                                                                1001 PQexecFinish + 41 (libpq.5.dylib) [0x10b1f7b02]
                                                                  1001 PQgetResult + 136 (libpq.5.dylib) [0x10b1f783d]
                                                                    1001 pqWaitTimed + 25 (libpq.5.dylib) [0x10b1f9e85]
                                                                      1001 poll + 10 (libsystem_kernel.dylib)
[0x7fff8abe1f96]
                                                                       *1001 hndl_unix_scall64 + 19 (mach_kernel)
[0xffffff80002cdd23]
                                                                         *1001 unix_syscall64 + 522 (mach_kernel)
[0xffffff80005e063a]
                                                                           *1001 poll_nocancel + 798 (mach_kernel)
[0xffffff80005778de]
                                                                             *1001 kqueue_scan + 187 (mach_kernel)
[0xffffff800054e1fb]
                                                                               *1001 thread_block_reason + 300
(mach_kernel)[0xffffff800022da0c] 
                                                                                 *1001 ??? (mach_kernel + 190273)
[0xffffff800022e741]
                                                                                   *1001 machine_switch_context + 366
(mach_kernel)[0xffffff80002b403e] 


The python pglib clients issuing these calls have been used for years, and we did not see this issue with Postgres
9.0.4,so I don't believe the clients are at fault. 

We did not see this issue for years while running Postgres 9.0.4, however we were also on an earlier Mac OS version as
well,so there's no clear upgrade path that I can blame. 

I see no system.log messages at the time of this issue.  We have bumped the shared memory and open file limits on this
machineand seem to be well within them at the moment.  Other postgres traffic is churning along fine from the other
postmasters;it is only these growing few that seem stuck. 

Does this sound related to anything that might have been attempted to be fixed in 9.3.2 ?  Does the sample backtrace
looklike it could be a Postgres issue, or could it be a Mac OS X / unix network issue ? 

thanks for any help you can give.

    ~ john

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Joshua Yanovski
Дата:
Сообщение: Re: BUG #9204: truncate_identifier may be called unnecessarily on escaped quoted identifiers
Следующее
От: Tom Lane
Дата:
Сообщение: Re: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)