Обсуждение: Query hanging/not finishing inconsistently

Поиск
Список
Период
Сортировка

Query hanging/not finishing inconsistently

От
"Meetesh Karia"
Дата:
Hi all,

We've recently started having a problem where a query that normally executes in ~15ms starts to take upwards of 20s to complete.  When the connection that ran query is returned to the connection pool, it appears as though a transaction is still in progress so the connection pool tries to cancel the transaction and close the connection.  This fails and the connection is removed from the connection pool.  At this point, the situation rapidly degrades and we run out of connections to the postgres server.

An inspection of the pg_stat_activity table shows that practically every connection is running the above-mentioned query and some of those queries have been active for many minutes!  We've looked at the pg_locks table as well and the only exclusive locks are on transactions that are open.  All other locks are AccessShareLocks.  Also, as far as we can tell (from looking at the Hibernate stats), every db session that is opened is closed.

When this happens, if I kill one of the running postgres processes (just by picking the last process returned from "ps -ef | grep postgres"), the other queries will immediately finish and the system will respond.  However, within 15 minutes, we'll be back in the same state as before.  At that point, I've cycled Apache, Tomcat and Postgres and the system then seems to come back.

This problem appears to be unrelated to load and in fact, the majority of the time there is very little load on the site when this occurs.  We've run load tests on our dev boxes but we've been unable to reproduce the problem.  We're currently working on playing back the clicks on the site previous to the weird state the site gets in and at the same time, we were wondering if anyone has experienced a problem like this or has any suggestions.

The query in question is:

select distinct s.screening_id, f.film_id, f.title, s.period_start, f.runtime, c.value, v.short_name, s.parent_id,
        stats.avg_rating, coalesce(stats.num_adds, 0) as num_adds, coalesce(stats.unique_visits, 0) as unique_visits,
        f.*, s.*
    from lte_screening s
        inner join lte_film f on s.film_id = f.film_id
        inner join lte_venue v on s.venue_id = v.venue_id
        inner join lte_film_classifier c on c.film_id = f.film_id
        left join lte_film_stats stats on stats.context = :context and stats.film_id = s.film_id
    where c.name = ? and s.period_start is not null and s.festival_id = ?
        and s.period_start between ? + ? and ? + ?
    order by s.period_start, f.title;

And the result of explain analyze:

QUERY PLAN
Unique  (cost=1117.42..1118.71 rows=11 width=866) (actual time= 18.306..18.386 rows=15 loops=1)
  ->  Sort  (cost=1117.42..1117.44 rows=11 width=866) (actual time=18.300..18.316 rows=15 loops=1)
        Sort Key: s.period_start, f.title, s.screening_id, f.film_id, f.runtime, c.value, v.short_name, s.parent_id, stats.avg_rating, COALESCE(stats.num_adds, 0), COALESCE(stats.unique_visits, 0::bigint), f.film_id, f.sku, f.title, f."template", f.release_date, f.runtime, f."language", f.country, f.mpaa_rating, f.synopsis , f."owner", f.ext_sales_rank, f.small_image_url, f.medium_image_url, f.large_image_url, f.detail_page, f.to_delete, f.coalesce_to, (subplan), (subplan), s.screening_id, s.period_start, s.period_end, s.ticket_price , s.tickets_avail, s.tickets_sold, s."type", s.venue_id, s.festival_id, s.film_id, s.parent_id, s.ext_id, s.purchase_url, s.status, s.status_update_time
        ->  Nested Loop Left Join  (cost=2.62..1117.23 rows=11 width=866) (actual time=2.656..17.773 rows=15 loops=1)
              ->  Nested Loop  (cost=2.62..976.00 rows=11 width=846) (actual time=2.347..16.162 rows=15 loops=1)
                    ->  Hash Join  (cost= 2.62..929.09 rows=10 width=831) (actual time=2.217..15.480 rows=15 loops=1)
                          Hash Cond: ("outer".venue_id = "inner".venue_id)
                          ->  Nested Loop  (cost=0.00..926.32 rows=10 width=818) (actual time=1.915..14.974 rows=15 loops=1)
                                ->  Seq Scan on lte_screening s  (cost=0.00..886.67 rows=10 width=159) (actual time= 1.830..14.314 rows=15 loops=1)
                                      Filter: ((period_start IS NOT NULL) AND (festival_id = 316372) AND (period_start >= '2006-05-19 05:00:00'::timestamp without time zone) AND (period_start <= '2006-05-20 04:59:59'::timestamp without time zone))
                                ->  Index Scan using lte_film_pkey on lte_film f  (cost=0.00..3.95 rows=1 width=659) (actual time= 0.026..0.028 rows=1 loops=15)
                                      Index Cond: ("outer".film_id = f.film_id )
                          ->  Hash  (cost=2.50..2.50 rows=50 width=21) (actual time=0.215..0.215 rows=0 loops=1)
                                ->  Seq Scan on lte_venue v  (cost=0.00..2.50 rows=50 width=21) (actual time=0.012..0.126 rows=52 loops=1)
                    ->  Index Scan using idx_classifier_film on lte_film_classifier c  (cost=0.00..4.67 rows=2 width=23) (actual time=0.026..0.028 rows=1 loops=15)
                          Index Cond: (c.film_id = "outer".film_id)
                          Filter: ((name)::text = 'FestivalCategory'::text)
              ->  Index Scan using lte_film_stats_pkey on lte_film_stats stats  (cost=0.00..4.34 rows=1 width=28) (actual time=0.034..0.037 rows=1 loops=15)
                    Index Cond: ((stats.context = 316372) AND (stats.film_id = "outer".film_id))
              SubPlan
                ->  Index Scan using idx_collateral_film on lte_film_collateral c  (cost=0.00..4.24 rows=1 width=40) (actual time=0.009..0.011 rows=1 loops=15)
                      Index Cond: (film_id = $0)
                      Filter: ((name)::text = 'TVRating'::text)
                ->  Index Scan using idx_collateral_film on lte_film_collateral c  (cost= 0.00..4.24 rows=1 width=40) (actual time=0.022..0.025 rows=1 loops=15)
                      Index Cond: (film_id = $0)
                      Filter: ((name)::text = 'IMDBId'::text)
Total runtime: 19.077 ms


Here is our setup:

We have 2 machines.  The first is the web server and the db server and the second is just another web server:

Machine A
- 1 GB RAM
- 1 Intel(R) Xeon(TM) CPU 2.80GHz HyperThreaded Processor
- CentOS 4.3
- Linux moe 2.6.9-22.ELsmp #1 SMP Sat Oct 8 19:11:43 CDT 2005 i686 i686 i386 GNU/Linux

Machine B
- 1 GB RAM
- 1 Intel(R) Xeon(TM) CPU 2.80GHz Processor
- CentOS 4.3
- Linux larry 2.6.9-22.0.1.EL #1 Thu Oct 27 12:26:11 CDT 2005 i686 i686 i386 GNU/Linux

We're using the following software:
- Apache 2.0.52
- Tomcat 5.5.17
- Postgres 8.0.6
- JDK 1.5.0-Release 6
- Proxool 0.8.3
- Hibernate 3.1.3

Thanks in advance for any help,
Meetesh

Re: Query hanging/not finishing inconsistently

От
"Craig A. James"
Дата:
Meetesh Karia wrote:
> Hi all,
>
> We've recently started having a problem where a query that normally
> executes in ~15ms starts to take upwards of 20s to complete.  When the
> connection that ran query is returned to the connection pool, it appears
> as though a transaction is still in progress so the connection pool
> tries to cancel the transaction and close the connection.  This fails
> and the connection is removed from the connection pool.  At this point,
> the situation rapidly degrades and we run out of connections to the
> postgres server.
>
> An inspection of the pg_stat_activity table shows that practically every
> connection is running the above-mentioned query and some of those
> queries have been active for many minutes!  We've looked at the pg_locks
> table as well and the only exclusive locks are on transactions that are
> open.  All other locks are AccessShareLocks.  Also, as far as we can
> tell (from looking at the Hibernate stats), every db session that is
> opened is closed.
>
> When this happens, if I kill one of the running postgres processes (just
> by picking the last process returned from "ps -ef | grep postgres"), the
> other queries will immediately finish and the system will respond.
> However, within 15 minutes, we'll be back in the same state as before.
> At that point, I've cycled Apache, Tomcat and Postgres and the system
> then seems to come back.

This sounds suspiciously like a question I asked a few weeks ago, on April 4.  I have a process that just gets stuck.
Aftersome questions from various of the experts in this forum, I used gdb(1) to attach to one of the frozen Postgress
backendprocesses, and here's what I found: 

On 5/12/2006, I wrote:
> Thanks, good advice.  You're absolutely right, it's stuck on a
> mutex.  After doing what you suggest, I discovered that the query
> in progress is a user-written function (mine).  When I log in as
> root, and use "gdb -p <pid>" to attach to the process, here's
> what I find.  Notice the second function in the stack, a mutex
> lock:
>
> (gdb) bt
> #0  0x0087f7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x0096cbfe in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
> #2  0x008ff67b in _L_mutex_lock_3220 () from /lib/tls/libc.so.6
> #3  0x4f5fc1b4 in ?? ()
> #4  0x00dc5e64 in std::string::_Rep::_S_empty_rep_storage () from /usr/local/pgsql/lib/libchmoogle.so
> #5  0x009ffcf0 in ?? () from /usr/lib/libz.so.1
> #6  0xbfe71c04 in ?? ()
> #7  0xbfe71e50 in ?? ()
> #8  0xbfe71b78 in ?? ()
> #9  0x009f7019 in zcfree () from /usr/lib/libz.so.1
> #10 0x009f7019 in zcfree () from /usr/lib/libz.so.1
> #11 0x009f8b7c in inflateEnd () from /usr/lib/libz.so.1
> #12 0x00c670a2 in ~basic_unzip_streambuf (this=0xbfe71be0) at zipstreamimpl.h:332
> #13 0x00c60b61 in OpenBabel::OBConversion::Read (this=0x1, pOb=0xbfd923b8, pin=0xffffffea) at istream:115
> #14 0x00c60fd8 in OpenBabel::OBConversion::ReadString (this=0x8672b50, pOb=0xbfd923b8) at obconversion.cpp:780
> #15 0x00c19d69 in chmoogle_ichem_mol_alloc () at stl_construct.h:120
> #16 0x00c1a203 in chmoogle_ichem_normalize_parent () at stl_construct.h:120
> #17 0x00c1b172 in chmoogle_normalize_parent_sdf () at vector.tcc:243
> #18 0x0810ae4d in ExecMakeFunctionResult ()
> #19 0x0810de2e in ExecProject ()
> #20 0x08115972 in ExecResult ()
> #21 0x08109e01 in ExecProcNode ()
> #22 0x00000020 in ?? ()
> #23 0xbed4b340 in ?? ()
> #24 0xbf92d9a0 in ?? ()
> #25 0xbed4b0c0 in ?? ()
> #26 0x00000000 in ?? ()
>
> It looks to me like my code is trying to read the input parameter
> (a fairly long string, maybe 2K) from a buffer that was gzip'ed
> by Postgres for the trip between the client and server... somewhere
> along the way, a mutex gets set, and then ... it's stuck forever.
>
> ps(1) shows that this thread had been running for about 7 hours,
> and the job status showed that this function had been
> successfully called about 1 million times, before this mutex lock
> occurred.

This is not an issue that's been resolved.  Nobody had ever seen this before.  Tom Lane suggested it might be a
libc/c++bug, but unfortunately in my case this lockup occurs so rarely (every few days) that it will be very difficult
toknow if we've fixed the problem. 

If gdb(1) reveals that your process is stuck in a mutex, then you might have a better chance testing this hypothesis,
sinceyour problem happens within 15 minutes or so. 

Did this start recently, perhaps right after a kernel update?

Craig

Re: Query hanging/not finishing inconsistently

От
"Meetesh Karia"
Дата:
Hi Craig,

Thanks for your response.  This did start recently and it wasn't after a kernel update, but it was after we moved the db from Machine B to Machine A (which have slightly different kernel versions).  However, the problem took about a week to show up after we moved from one machine to the other.  Unfortunately, the problem only reappears after 15 mins once it occurs the first time.  If it occurs again today I'll attach gdb to it and see whether it's stuck on a mutex.

Meetesh

On 5/22/06, Craig A. James <cjames@modgraph-usa.com> wrote:
Meetesh Karia wrote:
> Hi all,
>
> We've recently started having a problem where a query that normally
> executes in ~15ms starts to take upwards of 20s to complete.  When the
> connection that ran query is returned to the connection pool, it appears
> as though a transaction is still in progress so the connection pool
> tries to cancel the transaction and close the connection.  This fails
> and the connection is removed from the connection pool.  At this point,
> the situation rapidly degrades and we run out of connections to the
> postgres server.
>
> An inspection of the pg_stat_activity table shows that practically every
> connection is running the above-mentioned query and some of those
> queries have been active for many minutes!  We've looked at the pg_locks
> table as well and the only exclusive locks are on transactions that are
> open.  All other locks are AccessShareLocks.  Also, as far as we can
> tell (from looking at the Hibernate stats), every db session that is
> opened is closed.
>
> When this happens, if I kill one of the running postgres processes (just
> by picking the last process returned from "ps -ef | grep postgres"), the
> other queries will immediately finish and the system will respond.
> However, within 15 minutes, we'll be back in the same state as before.
> At that point, I've cycled Apache, Tomcat and Postgres and the system
> then seems to come back.

This sounds suspiciously like a question I asked a few weeks ago, on April 4.  I have a process that just gets stuck.  After some questions from various of the experts in this forum, I used gdb(1) to attach to one of the frozen Postgress backend processes, and here's what I found:

On 5/12/2006, I wrote:
> Thanks, good advice.  You're absolutely right, it's stuck on a
> mutex.  After doing what you suggest, I discovered that the query
> in progress is a user-written function (mine).  When I log in as
> root, and use "gdb -p <pid>" to attach to the process, here's
> what I find.  Notice the second function in the stack, a mutex
> lock:
>
> (gdb) bt
> #0  0x0087f7a2 in _dl_sysinfo_int80 () from /lib/ld- linux.so.2
> #1  0x0096cbfe in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
> #2  0x008ff67b in _L_mutex_lock_3220 () from /lib/tls/libc.so.6
> #3  0x4f5fc1b4 in ?? ()
> #4  0x00dc5e64 in std::string::_Rep::_S_empty_rep_storage () from /usr/local/pgsql/lib/libchmoogle.so
> #5  0x009ffcf0 in ?? () from /usr/lib/libz.so.1
> #6  0xbfe71c04 in ?? ()
> #7  0xbfe71e50 in ?? ()
> #8  0xbfe71b78 in ?? ()
> #9  0x009f7019 in zcfree () from /usr/lib/libz.so.1
> #10 0x009f7019 in zcfree () from /usr/lib/libz.so.1
> #11 0x009f8b7c in inflateEnd () from /usr/lib/libz.so.1
> #12 0x00c670a2 in ~basic_unzip_streambuf (this=0xbfe71be0) at zipstreamimpl.h:332
> #13 0x00c60b61 in OpenBabel::OBConversion::Read (this=0x1, pOb=0xbfd923b8, pin=0xffffffea) at istream:115
> #14 0x00c60fd8 in OpenBabel::OBConversion::ReadString (this=0x8672b50, pOb=0xbfd923b8) at obconversion.cpp:780
> #15 0x00c19d69 in chmoogle_ichem_mol_alloc () at stl_construct.h:120
> #16 0x00c1a203 in chmoogle_ichem_normalize_parent () at stl_construct.h:120
> #17 0x00c1b172 in chmoogle_normalize_parent_sdf () at vector.tcc:243
> #18 0x0810ae4d in ExecMakeFunctionResult ()
> #19 0x0810de2e in ExecProject ()
> #20 0x08115972 in ExecResult ()
> #21 0x08109e01 in ExecProcNode ()
> #22 0x00000020 in ?? ()
> #23 0xbed4b340 in ?? ()
> #24 0xbf92d9a0 in ?? ()
> #25 0xbed4b0c0 in ?? ()
> #26 0x00000000 in ?? ()
>
> It looks to me like my code is trying to read the input parameter
> (a fairly long string, maybe 2K) from a buffer that was gzip'ed
> by Postgres for the trip between the client and server... somewhere
> along the way, a mutex gets set, and then ... it's stuck forever.
>
> ps(1) shows that this thread had been running for about 7 hours,
> and the job status showed that this function had been
> successfully called about 1 million times, before this mutex lock
> occurred.

This is not an issue that's been resolved.  Nobody had ever seen this before.  Tom Lane suggested it might be a libc/c++ bug, but unfortunately in my case this lockup occurs so rarely (every few days) that it will be very difficult to know if we've fixed the problem.

If gdb(1) reveals that your process is stuck in a mutex, then you might have a better chance testing this hypothesis, since your problem happens within 15 minutes or so.

Did this start recently, perhaps right after a kernel update?

Craig

Re: Query hanging/not finishing inconsistently

От
"Meetesh Karia"
Дата:
Hi all,

I just saw another email on the mailing list to this effect as well.  We recently updated the kernel versions on our machines to the latest stable versions (which contained both HyperThreading and IO bug fixes) and we updated Postgres to version 8.0.8.  We thought we were in the clear when we didn't encounter a hang for 6+ days.  But, once again we ran into the same situation where a query that normally executes in ~15ms wouldn't finish.  As before, there were no ungranted locks and threads weren't waiting on a lock.  I attached gdb to one of the stuck postgres processes and got the following stack trace:

#0  0x008967a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00977e5b in semop () from /lib/tls/libc.so.6
#2  0x08167298 in PGSemaphoreLock ()
#3  0x0818bcb5 in LWLockAcquire ()
#4  0x080a47f5 in SimpleLruWritePage ()
#5  0x080a48ad in SimpleLruReadPage ()
#6  0x080a519a in SubTransGetParent ()
#7  0x080a51f2 in SubTransGetTopmostTransaction ()
#8  0x0821371c in HeapTupleSatisfiesSnapshot ()
#9  0x080822a2 in heap_release_fetch ()
#10 0x080880fb in index_getnext ()
#11 0x08128507 in ExecReScanHashJoin ()
#12 0x08122a09 in ExecScan ()
#13 0x081287f9 in ExecIndexScan ()
#14 0x0811dfdd in ExecProcNode ()
#15 0x0812a49f in ExecNestLoop ()
#16 0x0811df9d in ExecProcNode ()
#17 0x0812b74d in ExecSort ()
#18 0x0811df5d in ExecProcNode ()
#19 0x0812b941 in ExecUnique ()
#20 0x0811df2c in ExecProcNode ()
#21 0x0811ce18 in ExecutorRun ()
#22 0x081947ec in PortalSetResultFormat ()
#23 0x08194df4 in PortalRun ()
#24 0x08192ef7 in PostgresMain ()
#25 0x08169780 in ClosePostmasterPorts ()
#26 0x0816b0ae in PostmasterMain ()
#27 0x0813a5a6 in main ()

We then upgraded glibc to 2.3.4-2.19 but we encountered the problem within a day.  Our latest attempt at isolating the problem has been to reboot the machine with a 'noht' kernel param.  The machine has been up for 1 day, 13:18 since then and we haven't seen the problem yet.

Has anyone been able to solve this problem?

Thanks,
Meetesh

On 5/22/06, Meetesh Karia < meetesh.karia@gmail.com > wrote:
Hi Craig,

Thanks for your response.  This did start recently and it wasn't after a kernel update, but it was after we moved the db from Machine B to Machine A (which have slightly different kernel versions).  However, the problem took about a week to show up after we moved from one machine to the other.  Unfortunately, the problem only reappears after 15 mins once it occurs the first time.  If it occurs again today I'll attach gdb to it and see whether it's stuck on a mutex.

Meetesh


On 5/22/06, Craig A. James <cjames@modgraph-usa.com> wrote:
Meetesh Karia wrote:
> Hi all,
>
> We've recently started having a problem where a query that normally
> executes in ~15ms starts to take upwards of 20s to complete.  When the
> connection that ran query is returned to the connection pool, it appears
> as though a transaction is still in progress so the connection pool
> tries to cancel the transaction and close the connection.  This fails
> and the connection is removed from the connection pool.  At this point,
> the situation rapidly degrades and we run out of connections to the
> postgres server.
>
> An inspection of the pg_stat_activity table shows that practically every
> connection is running the above-mentioned query and some of those
> queries have been active for many minutes!  We've looked at the pg_locks
> table as well and the only exclusive locks are on transactions that are
> open.  All other locks are AccessShareLocks.  Also, as far as we can
> tell (from looking at the Hibernate stats), every db session that is
> opened is closed.
>
> When this happens, if I kill one of the running postgres processes (just
> by picking the last process returned from "ps -ef | grep postgres"), the
> other queries will immediately finish and the system will respond.
> However, within 15 minutes, we'll be back in the same state as before.
> At that point, I've cycled Apache, Tomcat and Postgres and the system
> then seems to come back.

This sounds suspiciously like a question I asked a few weeks ago, on April 4.  I have a process that just gets stuck.  After some questions from various of the experts in this forum, I used gdb(1) to attach to one of the frozen Postgress backend processes, and here's what I found:

On 5/12/2006, I wrote:
> Thanks, good advice.  You're absolutely right, it's stuck on a
> mutex.  After doing what you suggest, I discovered that the query
> in progress is a user-written function (mine).  When I log in as
> root, and use "gdb -p <pid>" to attach to the process, here's
> what I find.  Notice the second function in the stack, a mutex
> lock:
>
> (gdb) bt
> #0  0x0087f7a2 in _dl_sysinfo_int80 () from /lib/ld- linux.so.2
> #1  0x0096cbfe in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
> #2  0x008ff67b in _L_mutex_lock_3220 () from /lib/tls/libc.so.6
> #3  0x4f5fc1b4 in ?? ()
> #4  0x00dc5e64 in std::string::_Rep::_S_empty_rep_storage () from /usr/local/pgsql/lib/libchmoogle.so
> #5  0x009ffcf0 in ?? () from /usr/lib/libz.so.1
> #6  0xbfe71c04 in ?? ()
> #7  0xbfe71e50 in ?? ()
> #8  0xbfe71b78 in ?? ()
> #9  0x009f7019 in zcfree () from /usr/lib/libz.so.1
> #10 0x009f7019 in zcfree () from /usr/lib/libz.so.1
> #11 0x009f8b7c in inflateEnd () from /usr/lib/libz.so.1
> #12 0x00c670a2 in ~basic_unzip_streambuf (this=0xbfe71be0) at zipstreamimpl.h:332
> #13 0x00c60b61 in OpenBabel::OBConversion::Read (this=0x1, pOb=0xbfd923b8, pin=0xffffffea) at istream:115
> #14 0x00c60fd8 in OpenBabel::OBConversion::ReadString (this=0x8672b50, pOb=0xbfd923b8) at obconversion.cpp:780
> #15 0x00c19d69 in chmoogle_ichem_mol_alloc () at stl_construct.h:120
> #16 0x00c1a203 in chmoogle_ichem_normalize_parent () at stl_construct.h:120
> #17 0x00c1b172 in chmoogle_normalize_parent_sdf () at vector.tcc:243
> #18 0x0810ae4d in ExecMakeFunctionResult ()
> #19 0x0810de2e in ExecProject ()
> #20 0x08115972 in ExecResult ()
> #21 0x08109e01 in ExecProcNode ()
> #22 0x00000020 in ?? ()
> #23 0xbed4b340 in ?? ()
> #24 0xbf92d9a0 in ?? ()
> #25 0xbed4b0c0 in ?? ()
> #26 0x00000000 in ?? ()
>
> It looks to me like my code is trying to read the input parameter
> (a fairly long string, maybe 2K) from a buffer that was gzip'ed
> by Postgres for the trip between the client and server... somewhere
> along the way, a mutex gets set, and then ... it's stuck forever.
>
> ps(1) shows that this thread had been running for about 7 hours,
> and the job status showed that this function had been
> successfully called about 1 million times, before this mutex lock
> occurred.

This is not an issue that's been resolved.  Nobody had ever seen this before.  Tom Lane suggested it might be a libc/c++ bug, but unfortunately in my case this lockup occurs so rarely (every few days) that it will be very difficult to know if we've fixed the problem.

If gdb(1) reveals that your process is stuck in a mutex, then you might have a better chance testing this hypothesis, since your problem happens within 15 minutes or so.

Did this start recently, perhaps right after a kernel update?

Craig


Re: Query hanging/not finishing inconsistently

От
Tom Lane
Дата:
"Meetesh Karia" <meetesh.karia@gmail.com> writes:
> ... But, once again we ran into the same
> situation where a query that normally executes in ~15ms wouldn't finish.  As
> before, there were no ungranted locks and threads weren't waiting on a
> lock.  I attached gdb to one of the stuck postgres processes and got the
> following stack trace:

> #0  0x008967a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x00977e5b in semop () from /lib/tls/libc.so.6
> #2  0x08167298 in PGSemaphoreLock ()
> #3  0x0818bcb5 in LWLockAcquire ()
> #4  0x080a47f5 in SimpleLruWritePage ()
> #5  0x080a48ad in SimpleLruReadPage ()
> #6  0x080a519a in SubTransGetParent ()
> #7  0x080a51f2 in SubTransGetTopmostTransaction ()
> #8  0x0821371c in HeapTupleSatisfiesSnapshot ()

What I'm wondering about is possible deadlock conditions inside slru.c.
There's no deadlock detection for LWLocks, so if it happened, the
processes involved would just freeze up.

If this happens again, would you collect stack traces from all the stuck
processes, not just one?

            regards, tom lane