Обсуждение: Hanging queries and I/O exceptions
Hello, While doing performance tests on Windows Server 2003 we observed to following two problems. Environment: J2EE application running in JBoss application server, against pgsql 8.1 database. Load is caused by a smallish number of (very) complex transactions, typically about 5-10 concurrently. The first one, which bothers me the most, is that after about 6-8 hours the application stops processing. No errors are reported, neither by the JDBC driver nor by the server, but when I kill the application server, I see that all my connections hang in a SQL statements (which never seem to return): 2006-03-03 08:17:12 4504 6632560 LOG: duration: 45087000.000 ms statement: EXECUTE <unnamed> [PREPARE: SELECT objID FROM objects WHERE objID = $1 FOR UPDATE] I think I can reliably reproduce this by loading the app, and waiting a couple of hours. The second problem is less predictable: JDBC exception: An I/O error occured while sending to the backend. org.postgresql.util.PSQLException: An I/O error occured while sending to the backend. at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:214) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:430) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:346) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:250) In my server log, I have: 2006-03-02 12:31:02 5692 6436342 LOG: could not receive data from client: A non-blocking socket operation could not be completed immediately. At the time my box is fairly heavy loaded, but still responsive. Server and JBoss appserver live on the same dual 2Ghz Opteron. A quick Google told me that: 1. More people have seen this. 2. No solutions. 3. The server message appears to indicate an unhandled WSAEWOULDBLOCK winsock error on recv(), which MSDN said is to be expected and should be retried. Is this a known bug? jan -- -------------------------------------------------------------- Jan de Visser jdevisser@digitalfairway.com Baruk Khazad! Khazad ai-menu! --------------------------------------------------------------
I have more information on this issue. First of, the problem now happens after about 1-2 hours, as opposed to the 6-8 I mentioned earlier. Yey for shorter test cycles. Furtermore, it does not happen on Linux machines, both single CPU and dual CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU windows machine, and if we take one CPU out, it does not happen. I executed the following after it hung: db=# select l.pid, c.relname, l.mode, l.granted, l.page, l.tuple from pg_locks l, pg_class c where c.oid = l.relation order by l.pid; Which showed me that several transactions where waiting for a particular row which was locked by another transaction. This transaction had no pending locks (so no deadlock), but just does not complete and hence never relinquishes the lock. What gives? has anybody ever heard of problems like this on dual CPU windows machines? jan On Monday 06 March 2006 09:38, Jan de Visser wrote: > Hello, > > While doing performance tests on Windows Server 2003 we observed to > following two problems. > > Environment: J2EE application running in JBoss application server, against > pgsql 8.1 database. Load is caused by a smallish number of (very) complex > transactions, typically about 5-10 concurrently. > > The first one, which bothers me the most, is that after about 6-8 hours the > application stops processing. No errors are reported, neither by the JDBC > driver nor by the server, but when I kill the application server, I see > that all my connections hang in a SQL statements (which never seem to > return): > > 2006-03-03 08:17:12 4504 6632560 LOG: duration: 45087000.000 ms > statement: EXECUTE <unnamed> [PREPARE: SELECT objID FROM objects WHERE > objID = $1 FOR UPDATE] > > I think I can reliably reproduce this by loading the app, and waiting a > couple of hours. -- -------------------------------------------------------------- Jan de Visser jdevisser@digitalfairway.com Baruk Khazad! Khazad ai-menu! --------------------------------------------------------------
Jan de Visser <jdevisser@digitalfairway.com> writes: > Furtermore, it does not happen on Linux machines, both single CPU and dual > CPU, nor on single CPU windows machines. We can only reproduce on a dual CPU > windows machine, and if we take one CPU out, it does not happen. > ... > Which showed me that several transactions where waiting for a particular row > which was locked by another transaction. This transaction had no pending > locks (so no deadlock), but just does not complete and hence never > relinquishes the lock. Is the stuck transaction still consuming CPU time, or just stopped? Is it possible to get a stack trace from the stuck process? I dunno if you've got anything gdb-equivalent under Windows, but that's the first thing I'd be interested in ... regards, tom lane
On Thursday 09 March 2006 15:10, Tom Lane wrote: > Jan de Visser <jdevisser@digitalfairway.com> writes: > > Furtermore, it does not happen on Linux machines, both single CPU and > > dual CPU, nor on single CPU windows machines. We can only reproduce on a > > dual CPU windows machine, and if we take one CPU out, it does not happen. > > ... > > Which showed me that several transactions where waiting for a particular > > row which was locked by another transaction. This transaction had no > > pending locks (so no deadlock), but just does not complete and hence > > never relinquishes the lock. > > Is the stuck transaction still consuming CPU time, or just stopped? CPU drops off. In fact, that's my main clue something's wrong ;-) > > Is it possible to get a stack trace from the stuck process? I dunno > if you've got anything gdb-equivalent under Windows, but that's the > first thing I'd be interested in ... I wouldn't know. I'm hardly a windows expert. Prefer not to touch the stuff, myself. Can do some research though... > > regards, tom lane jan -- -------------------------------------------------------------- Jan de Visser jdevisser@digitalfairway.com Baruk Khazad! Khazad ai-menu! --------------------------------------------------------------
On Thursday 09 March 2006 15:10, Tom Lane wrote: > Is it possible to get a stack trace from the stuck process? I dunno > if you've got anything gdb-equivalent under Windows, but that's the > first thing I'd be interested in ... Here ya go: http://www.devisser-siderius.com/stack1.jpg http://www.devisser-siderius.com/stack2.jpg http://www.devisser-siderius.com/stack3.jpg There are three threads in the process. I guess thread 1 (stack1.jpg) is the most interesting. I also noted that cranking up concurrency in my app reproduces the problem in about 4 minutes ;-) With thanks to Magnus Hagander for the Process Explorer hint. jan -- -------------------------------------------------------------- Jan de Visser jdevisser@digitalfairway.com Baruk Khazad! Khazad ai-menu! --------------------------------------------------------------