Обсуждение: commit so slow program looks frozen

От:
"Carlo Stonebanks"
Дата:

(I tried this question on the interface forum and got no result, but I don't
know how to tell if it's an interface issue or not)

I have a TCL app which typically takes hours to complete. I found out that
it is taking longer than it should because it occasionally stalls
inexplicably (for tens of minute at a time) then usually continues.

There are a minimum of four apps running at the same time, all reading
different sections of the same table, all writing to the same db and the
same tables. The other apps seem unaffected by the one app that freezes.

This happens running "pg_exec $conn "commit" from within a TCL script on a
client app.


The delays are so long that I used to think the app was hopelessly frozen.
By accident, I left the app alone in its frozen state and came back a good
deal later and seen that it was running again.

Sometimes I decide it *IS* frozen and have to restart. Because Ctrl-C will
not cause the script to break, it appears the app is stuck in non-TCL code
(either waiting for postgres or stuck in the interface code?)

The application loops through an import file, reading one row at a time, and
issues a bunch of inserts and updates to various tables. There's a simple
pg_exec $conn "start transaction" at the beginning of the loop and the
commit at the end. The commit actually appears to be going through.

There are no messages of any significance in the log. There do not appear to
be any outstanding locks or transactions.

I am not doing any explicit locking, all transaction settings are set to
default.

Any thoughts on the cause and possible solutions would be appreciated.

Carlo



От:
Alvaro Herrera
Дата:

Carlo Stonebanks wrote:

> The delays are so long that I used to think the app was hopelessly frozen.
> By accident, I left the app alone in its frozen state and came back a good
> deal later and seen that it was running again.
>
> Sometimes I decide it *IS* frozen and have to restart. Because Ctrl-C will
> not cause the script to break, it appears the app is stuck in non-TCL code
> (either waiting for postgres or stuck in the interface code?)

You may try to figure out what's the process doing (the backend
obviously, not the frontend (Tcl) process) by attaching to it with
strace.  Is it doing system calls?  Maybe it's busy reading from or
writing to disk.  Maybe it's swamped by a context switch storm (but in
that case, probably the other processes would be affected as well).

Or you may want to attach to it with GDB and see what the backtrace
looks like.  If nothing obvious pops up, do it several times and compare
them.

I wouldn't expect it to be stuck on locks, because if it's only on
commit, then it probably has all the locks it needs.  But try to see if
you can find something not granted in pg_locks that it may be stuck on.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

От:
"Carlo Stonebanks"
Дата:

> You may try to figure out what's the process doing (the backend
> obviously, not the frontend (Tcl) process) by attaching to it with
> strace.

It's so sad when us poor Windows guys get helpful hints from people assume
that we're smart enough to run *NIX... ;-)

> Maybe it's swamped by a context switch storm (but in that case, probably
> the other processes would be affected as well).

What is a context switch storm? (and what a great name for a heavy metal
rock band!)

Interestingly enough, last night (after the original post) I watched three
of the processes slow down, one after the other - and then stall for so long
that I had assumed they had frozen. They were all stalled on a message that
I had put in the script that indicated they had never returned from a
commit. I have looked into this, and I believe the commits are actually
going through.

The remaining 4th process continued to run, and actually picked up speed as
the CPU gave its cycles over. The Windows task manager shows the postgresql
processes that (I assume) are associated with the stalled processes as
consuming zero CPU time.

Sometimes I have seen all of the apps slow down and momentarrily freeze at
the same time... but then continue. I have autovacuum off, although
stats_row_level and stats_start_collector remain on (I thought these were
only relevant if autovacuum was on).

I have seen the apps slow down (and perhaps stall) when specifical tables
have vacuum/analyze running, and that makes sense. I did notice that on one
occasion a "frozen" app came back to life after I shut down EMS PostgreSQL
manager in another session. Maybe a coincidence, or maybe an indication that
the apps are straining resources... on a box with two twin-core XEONs and
4GB of memory? Mind you, the config file is confgiured for the database
loading phase weare in now - with lots of resources devoted to a few
connections.

> I wouldn't expect it to be stuck on locks, because if it's only on
> commit, then it probably has all the locks it needs.  But try to see if
> you can find something not granted in pg_locks that it may be stuck on.

Looking at the pgadmin server status pages, no locks or transactions are
pending when this happens.

Carlo



От:
"Carlo Stonebanks"
Дата:

>> I have a question for you: did you have a long running query keeping open a transaction?  I've just noticed the same problem here, but things cleaned up immediately when I aborted the long-running transaction.

No, the only processes are from those in the import applications themselves: short transactions never lasting more than a fraction of a second.

 

Carlo

От:
Brian Hurt
Дата:

Carlo Stonebanks wrote:
You may try to figure out what's the process doing (the backend
obviously, not the frontend (Tcl) process) by attaching to it with
strace.   
It's so sad when us poor Windows guys get helpful hints from people assume 
that we're smart enough to run *NIX... ;-)
 
Maybe it's swamped by a context switch storm (but in that case, probably 
the other processes would be affected as well).   
What is a context switch storm? (and what a great name for a heavy metal 
rock band!)

Interestingly enough, last night (after the original post) I watched three 
of the processes slow down, one after the other - and then stall for so long 
that I had assumed they had frozen. They were all stalled on a message that 
I had put in the script that indicated they had never returned from a 
commit. I have looked into this, and I believe the commits are actually 
going through. 
I have a question for you: did you have a long running query keeping open a transaction?  I've just noticed the same problem here, but things cleaned up immediately when I aborted the long-running transaction.

Note that in my case the long-running transaction wasn't idle in transaction, it was just doing a whole lot of work.

Brian

От:
Scott Marlowe
Дата:

On Wed, 2006-10-25 at 15:07, Carlo Stonebanks wrote:
> > You may try to figure out what's the process doing (the backend
> > obviously, not the frontend (Tcl) process) by attaching to it with
> > strace.
>
> It's so sad when us poor Windows guys get helpful hints from people assume
> that we're smart enough to run *NIX... ;-)

You should try a google search on strace and NT or windows or XP...  I
was surprised how many various implementations of it I found.

>
> > Maybe it's swamped by a context switch storm (but in that case, probably
> > the other processes would be affected as well).
>
> What is a context switch storm? (and what a great name for a heavy metal
> rock band!)

I can just see the postgresql group getting together at the next
O'Reilley's conference and creating that band.  And it will all be your
fault.

A context switch storm is when your machine spends more time trying to
figure out what to do than actually doing anything.  The CPU spends most
it's time switching between programs than running them.


> I have seen the apps slow down (and perhaps stall) when specifical tables
> have vacuum/analyze running, and that makes sense. I did notice that on one
> occasion a "frozen" app came back to life after I shut down EMS PostgreSQL
> manager in another session. Maybe a coincidence, or maybe an indication that
> the apps are straining resources... on a box with two twin-core XEONs and
> 4GB of memory? Mind you, the config file is confgiured for the database
> loading phase weare in now - with lots of resources devoted to a few
> connections.

Seeing as PostgreSQL runs one thread / process per connection, it's
pretty unlikely that the problem here is one "hungry" thread.  Do all
four CPUs show busy, or just one?  Do you have a way of measuring how
much time is spent waiting on I/O on a windows machine like top / vmstat
does in unix?

Is it possible your machine is going into a swap storm?  i.e. you've
used all physical memory somehow and it's swapping out?  If your current
configuration is too aggresive on sort / work mem then it can happen
with only a few connections.

Note that if you have an import process that needs a big chunk of
memory, you can set just that one connection to use a large setting and
leave the default smaller.

От:
"Joshua D. Drake"
Дата:

>
>>> Maybe it's swamped by a context switch storm (but in that case, probably
>>> the other processes would be affected as well).
>> What is a context switch storm? (and what a great name for a heavy metal
>> rock band!)
>
> I can just see the postgresql group getting together at the next
> O'Reilley's conference and creating that band.  And it will all be your
> fault.

Well now you let the secret out!

Joshua D. Drake

--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


От:
"Jim C. Nasby"
Дата:

On Wed, Oct 25, 2006 at 04:32:16PM -0400, Carlo Stonebanks wrote:
> >> I have a question for you: did you have a long running query keeping open
> a transaction?  I've just noticed the same problem here, but things cleaned
> up immediately when I aborted the long-running transaction.
>
> No, the only processes are from those in the import applications themselves:
> short transactions never lasting more than a fraction of a second.

Do you have a linux/unix machine you could reproduce this on?
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
"Magnus Hagander"
Дата:

> > > You may try to figure out what's the process doing (the backend
> > > obviously, not the frontend (Tcl) process) by attaching
> to it with
> > > strace.
> >
> > It's so sad when us poor Windows guys get helpful hints from people
> > assume that we're smart enough to run *NIX... ;-)
>
> You should try a google search on strace and NT or windows or
> XP...  I was surprised how many various implementations of it I found.

Let me know if you find one that's stable, I've been wanting that. I've
tried one or two, but it's always been just a matter of time before the
inevitable BSOD.

> > > Maybe it's swamped by a context switch storm (but in that case,
> > > probably the other processes would be affected as well).
> >
> > What is a context switch storm? (and what a great name for a heavy
> > metal rock band!)
>
> I can just see the postgresql group getting together at the
> next O'Reilley's conference and creating that band.  And it
> will all be your fault.

*DO NOT LET DEVRIM SEE THIS THREAD*


> A context switch storm is when your machine spends more time
> trying to figure out what to do than actually doing anything.
>  The CPU spends most it's time switching between programs
> than running them.

I can see Windows benig more sucepitble to this than say Linux, because
switching between processes there is a lot more expensive than on Linux.

> Seeing as PostgreSQL runs one thread / process per
> connection, it's pretty unlikely that the problem here is one
> "hungry" thread.  Do all four CPUs show busy, or just one?
> Do you have a way of measuring how much time is spent waiting
> on I/O on a windows machine like top / vmstat does in unix?

There are plenty of counters in the Performance Monitor. Specificall,
look at "disk queue counters" - they indicate when the I/O subsystem is
backed up.


//Magnus

От:
"Rocco Altier"
Дата:

I seem to remember Oleg/Teodor recently reporting a problem with Windows
hanging on a multi-processor machine, during a heavy load operation.

In their case it seemed like a vacuum would allow it to wake up.  They
did commit a patch that did not make it into the last minor version for
lack of testing.

Perhaps you could see if that patch might work for you, which would also
help ease the argument against the patches lack of testing.

    -rocco

От:
"Carlo Stonebanks"
Дата:

> I can just see the postgresql group getting together at the next
> O'Reilley's conference and creating that band.  And it will all be your
> fault.

Finally, a chance for me to wear my black leather pants.

> A context switch storm is when your machine spends more time trying to
> figure out what to do than actually doing anything.  The CPU spends most
> it's time switching between programs than running them.

Is thatl likely on a new 4 CPU server that has no clients connected and that
is only running four (admittedly heavy) TCL data load scripts?

> Seeing as PostgreSQL runs one thread / process per connection, it's
> pretty unlikely that the problem here is one "hungry" thread.  Do all
> four CPUs show busy, or just one?  Do you have a way of measuring how
> much time is spent waiting on I/O on a windows machine like top / vmstat
> does in unix?

Before optimising the queries, all four CPU's were pinned to max performance
(that's why I only run four imports at a time). After opimisation, all four
CPU's are busy, but usage is spikey (which looks more normal), but all are
obviously busy. I have this feeling that when an import app freezes, one CPU
goes idle while the others stay busy - I will confirm that with the next
import operation.

I suspect that the server has the Xeon processors that were of a generation
which PostgreSQL had a problem with - should a postgresql process be able to
distrivute its processing load across CPU's? (i.e. When I see one CPU at
100% while all others are idle?)

> Note that if you have an import process that needs a big chunk of
> memory, you can set just that one connection to use a large setting and
> leave the default smaller.

Total memory usage is below the max available. Each postgresql process takes
up 500MB, there are four running and I have 4GB of RAM.

Carlo



От:
"Carlo Stonebanks"
Дата:

This is pretty interesting - where can I read more on this? Windows isn't
actually hanging, one single command line window is - from its behaviour, it
looks like the TCL postgresql package is waiting for pg_exec to come back
from the commit (I believe the commit has actually gone through).

It could even be that there's something wrong with the TCL package, but from
my understanding it is one of the most complete interfaces out there - which
is weird, because TCL seems to be the most unpopular language in the
community.

Caro


""Rocco Altier"" <> wrote in message
news:...
>I seem to remember Oleg/Teodor recently reporting a problem with Windows
> hanging on a multi-processor machine, during a heavy load operation.
>
> In their case it seemed like a vacuum would allow it to wake up.  They
> did commit a patch that did not make it into the last minor version for
> lack of testing.
>
> Perhaps you could see if that patch might work for you, which would also
> help ease the argument against the patches lack of testing.
>
> -rocco
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>



От:
"Merlin Moncure"
Дата:

On 10/26/06, Carlo Stonebanks <> wrote:
> This is pretty interesting - where can I read more on this? Windows isn't
> actually hanging, one single command line window is - from its behaviour, it
> looks like the TCL postgresql package is waiting for pg_exec to come back
> from the commit (I believe the commit has actually gone through).
>
> It could even be that there's something wrong with the TCL package, but from
> my understanding it is one of the most complete interfaces out there - which
> is weird, because TCL seems to be the most unpopular language in the
> community.

when it happens, make sure to query pg_locks and see what is going on
there lock issues are not supposed to manifest on a commit, which
releases locks, but you never know.  There have been reports of
insonsistent lock ups on windows (espeically multi-processor) which
you might be experiencing. Make sure you have the very latest version
of pg 8.1.x.  Also consider checking out 8.2 and see if you can
reproduce the behavior there...this will require compiling postgresql.

merlin

От:
Richard Troy
Дата:

> A context switch storm is when your machine spends more time trying to
> figure out what to do than actually doing anything.  The CPU spends most
> it's time switching between programs than running them.

Well, we usually use the term "thrashing" as the generic for when your
machine is spending more time on overhead than doing user work - this
would include paging or context switching, along with whatever else. A
context-switch storm would be a specific form of thrashing!

Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
, http://ScienceTools.com/


От:
Richard Troy
Дата:


On Thu, 26 Oct 2006, Carlo Stonebanks wrote:
>
> It could even be that there's something wrong with the TCL package, but from
> my understanding it is one of the most complete interfaces out there - which
> is weird, because TCL seems to be the most unpopular language in the
> community.
>

Not that this matters much and it's slightly off the topic of performance,
but...

...I would have to check my _ancient_ emails for the name of the guy and
the dates, but the integration was first done while I was a researcher at
Berkeley, at the tail end of the Postgres team's funding. My team used
Postgres with TCL internals to implement "the query from hell" inside the
server. That was about 1994 or '95, IIRC. At that time, most people who
knew both said that they were roughly equivalent, with PERL being _vastly_
less intelligible (to humans) and they hated it. What happened was PERL
got exposure that TCL didn't and people who didn't know better jumped on
it.

So, it was one of the most complete interfaces because it was done first,
or nearly first, by the original guys that created the original Postgres.

Richard


--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
, http://ScienceTools.com/


От:
"Carlo Stonebanks"
Дата:

> when it happens, make sure to query pg_locks and see what is going on
> there lock issues are not supposed to manifest on a commit, which
> releases locks, but you never know.

There aren't any pedning locks (assuming that pgAdmin is using pg_locks to
display pendin glocks).

> There have been reports of
> insonsistent lock ups on windows (espeically multi-processor) which
> you might be experiencing. Make sure you have the very latest version
> of pg 8.1.x.  Also consider checking out 8.2 and see if you can
> reproduce the behavior there...this will require compiling postgresql.

Are these associated with any type of CPU?

Carlo



От:
mark@mark.mielke.cc
Дата:

Perl started out fast - TCL started out slow. Perl used syntax that,
although it would drive some people crazy, followed a linguistic curve
that Larry Wall claimed was healthy. The English language is crazy,
and yet, it has become standard world wide as well. Designed, regular
languages like Esperanto have not received much support either.

Perl is designed to be practical. TCL was designed to be minimalistic.

Perl uses common idioms for UNIX programmers. // for regular expressions,
$VAR for variables, Many of the statement are familiar for C programmers.
++ for increment (compare against 'incr abc' for TCL). $a=5 for assignment,
compare against 'set abc 5' in TCL.

TCL tries to have a reduced syntax, where 'everything is a string'
which requires wierdness for people.  For example, newline is
end-of-line, so { must be positioned correctly. Code is a string, so
in some cases you need to escape code, otherwise not.

Perl has object oriented support built-in. It's ugly, but it works.
TCL has a questionable '[incr tcl]' package.

Perl has a wealth of modules on CPAN to do almost anything you need to.
TCL has the beginning of one (not as rich), but comes built-in with things
like event loops, and graphicals (Tk).

I could go on and on - but I won't, because this is the PostgreSQL
mailing list. People either get Perl, or TCL, or they don't. More
people 'get' Perl, because it was marketted better, it's syntax is
deceivingly comparable to other well known languages, and for the
longest time, it was much faster than TCL to write (especially when
using regular expressions) and faster to run.

Did TCL get treated unfairly as a result? It's a language. Who cares! :-)

Cheers,
mark

--
 /  /      __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


От:
"Joshua D. Drake"
Дата:

> Perl has a wealth of modules on CPAN to do almost anything you need to.
> TCL has the beginning of one (not as rich), but comes built-in with things
> like event loops, and graphicals (Tk).
>
> I could go on and on - but I won't, because this is the PostgreSQL
> mailing list. People either get Perl, or TCL, or they don't. More
> people 'get' Perl, because it was marketted better, it's syntax is
> deceivingly comparable to other well known languages, and for the
> longest time, it was much faster than TCL to write (especially when
> using regular expressions) and faster to run.
>
> Did TCL get treated unfairly as a result? It's a language. Who cares! :-)

You forgot the god of scripting languages, Python... (Yes perl is much
better at system level scripting than Python).

Sincerely,

Joshua D. Drake

>
> Cheers,
> mark
>


--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


От:
"Simon Riggs"
Дата:

On Thu, 2006-10-26 at 11:06 -0400, Merlin Moncure wrote:
> On 10/26/06, Carlo Stonebanks <> wrote:
> > This is pretty interesting - where can I read more on this? Windows isn't
> > actually hanging, one single command line window is - from its behaviour, it
> > looks like the TCL postgresql package is waiting for pg_exec to come back
> > from the commit (I believe the commit has actually gone through).
> >
> > It could even be that there's something wrong with the TCL package, but from
> > my understanding it is one of the most complete interfaces out there - which
> > is weird, because TCL seems to be the most unpopular language in the
> > community.
>
> when it happens, make sure to query pg_locks and see what is going on
> there lock issues are not supposed to manifest on a commit, which
> releases locks, but you never know.  There have been reports of
> insonsistent lock ups on windows (espeically multi-processor) which
> you might be experiencing. Make sure you have the very latest version
> of pg 8.1.x.  Also consider checking out 8.2 and see if you can
> reproduce the behavior there...this will require compiling postgresql.

Merlin,

Rumour has it you managed to get a BT from Windows. That sounds like it
would be very useful here.

Carlo,

Many things can happen at commit time. Temp tables dropped, TRUNCATEd
old relations unlinked, init files removed, deferred foreign key checks
(and subsequent cascading), dropped tables flushed. The assumption that
COMMIT is a short request may not be correct according to the wide range
of tasks that could occur according to standard SQL:2003 behaviour.

Some of those effects take longer on larger systems. Any and all of
those things have potential secondary effects, all of which can also
conflict with other user tasks and especially with a CHECKPOINT. Then
there's various forms of contention caused by misconfiguration.

I do think we need some better instrumentation for this kind of thing.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



От:
"Carlo Stonebanks"
Дата:

>Ben Trewern" <> wrote in message
>news:ehsotr$20vp$...
> It might be worth turning off hyperthreading if your Xeons are using it.
> There have been reports of this causing inconsistent behaviour with
> PostgreSQL.

Yes, this issue comes up often - I wonder if the Woodcrest Xeons resolved
this? Have these problems been experienced on both Linux and Windows (we are
running Windows 2003 x64)

Carlo




От:
"Carlo Stonebanks"
Дата:

> I do think we need some better instrumentation for this kind of thing.

Well, one thing's for sure - I have little other information to offer. The
problem is that the lockups occur after hours of operation and thousands of
rows being digested (which is the nature of the program). If "better
instrumentation" implies tools to inpsect the sate of the db server's
process and to know what it's waiting for from the OS, I agree.

Then again, I can't even tell you whether the postgres process is at fault
or the TCL interface - which would be odd, because it's one fo the most
mature interfaces postgres has. So, here's a thought: is there any way for
me to inspect the state of a postgres process to see if it's responsive -
even if it's serving another connection?

Carlo



От:
"Merlin Moncure"
Дата:

On 10/28/06, Simon Riggs <> wrote:
> On Thu, 2006-10-26 at 11:06 -0400, Merlin Moncure wrote:
> > On 10/26/06, Carlo Stonebanks <> wrote:
> > > This is pretty interesting - where can I read more on this? Windows isn't
> > > actually hanging, one single command line window is - from its behaviour, it
> > > looks like the TCL postgresql package is waiting for pg_exec to come back
> > > from the commit (I believe the commit has actually gone through).
> > >
> > > It could even be that there's something wrong with the TCL package, but from
> > > my understanding it is one of the most complete interfaces out there - which
> > > is weird, because TCL seems to be the most unpopular language in the
> > > community.
> >
> > when it happens, make sure to query pg_locks and see what is going on
> > there lock issues are not supposed to manifest on a commit, which
> > releases locks, but you never know.  There have been reports of
> > insonsistent lock ups on windows (espeically multi-processor) which
> > you might be experiencing. Make sure you have the very latest version
> > of pg 8.1.x.  Also consider checking out 8.2 and see if you can
> > reproduce the behavior there...this will require compiling postgresql.
>
> Merlin,
>
> Rumour has it you managed to get a BT from Windows. That sounds like it
> would be very useful here.
>
> Carlo,
>
> Many things can happen at commit time. Temp tables dropped, TRUNCATEd
> old relations unlinked, init files removed, deferred foreign key checks
> (and subsequent cascading), dropped tables flushed. The assumption that
> COMMIT is a short request may not be correct according to the wide range
> of tasks that could occur according to standard SQL:2003 behaviour.
>
> Some of those effects take longer on larger systems. Any and all of
> those things have potential secondary effects, all of which can also
> conflict with other user tasks and especially with a CHECKPOINT. Then
> there's various forms of contention caused by misconfiguration.
>
> I do think we need some better instrumentation for this kind of thing.
>
> --
>   Simon Riggs
>   EnterpriseDB   http://www.enterprisedb.com

start here:
http://beta.linuxports.com/pgsql-hackers-win32/2005-08/msg00051.php

merlin

От:
"Bucky Jordan"
Дата:

>
> Yes, this issue comes up often - I wonder if the Woodcrest Xeons
resolved
> this? Have these problems been experienced on both Linux and Windows
(we
> are
> running Windows 2003 x64)
>
> Carlo
>
IIRC Woodcrest doesn't have HT, just dual core with shared cache.

- Bucky

От:
Rob Lemley
Дата:

Merlin Moncure wrote:
> On 10/28/06, Simon Riggs <> wrote:
>> On Thu, 2006-10-26 at 11:06 -0400, Merlin Moncure wrote:
>> > On 10/26/06, Carlo Stonebanks <> wrote:
>> > > This is pretty interesting - where can I read more on this?
>> Windows isn't
>> > > actually hanging, one single command line window is - from its
>> behaviour, it
>> > > looks like the TCL postgresql package is waiting for pg_exec to
>> come back
>> > > from the commit (I believe the commit has actually gone through).
>> > >
>> > > It could even be that there's something wrong with the TCL
>> package, but from
>> > > my understanding it is one of the most complete interfaces out
>> there - which
>> > > is weird, because TCL seems to be the most unpopular language in the
>> > > community.
>> >
>> > when it happens, make sure to query pg_locks and see what is going on
>> > there lock issues are not supposed to manifest on a commit, which
>> > releases locks, but you never know.  There have been reports of
>> > insonsistent lock ups on windows (espeically multi-processor) which
>> > you might be experiencing. Make sure you have the very latest version
>> > of pg 8.1.x.  Also consider checking out 8.2 and see if you can
>> > reproduce the behavior there...this will require compiling postgresql.
>>
>> Merlin,
>>
>> Rumour has it you managed to get a BT from Windows. That sounds like it
>> would be very useful here.

Could it be there is a hangup in communication with the backend via the
libpq library?

I have a situation on Windows where psql seems to be hanging randomly
AFTER completing (or almost completing) a vacuum full analyze verbose.

I'm running the same databases on a single postgres instance on a Dell
4gb RAM 2 processor xeon (hyper-threading turned off) running Debian
GNU/Linux.  The windows system is an IBM 24gb RAM, 4 processor xeon
(hyperthreading turned off).  No problems on the Dell, it runs pgbench
faster than the windows IBM system.  The Dell Linux system zips through
vacuumdb --all --analyze --full --verbose with no problems.  The windows
machine is running 6 instances of postgresql because of problems trying
to load all of the databases into one instance on windows.

The last output from psql is:

INFO:  free space map contains 474 pages in 163 relations
DETAIL:  A total of 2864 page slots are in use (including overhead).
2864 page slots are required to track all free space.
Current limits are:  420000 page slots, 25000 relations, using 4154 KB.

(I've currently restarted postgresql with more reasonable fsm_page_slots
and fsm_relations).

It appears that psql is hung in the call to WS2_32!select.
The psql stack trace looks like this:

ntdll!KiFastSystemCallRet
ntdll!NtWaitForSingleObject+0xc
mswsock!SockWaitForSingleObject+0x19d
mswsock!WSPSelect+0x380
WS2_32!select+0xb9
WARNING: Stack unwind information not available. Following frames may be
wrong.
libpq!PQenv2encoding+0x1fb
libpq!PQenv2encoding+0x3a1
libpq!PQenv2encoding+0x408
libpq!PQgetResult+0x58
libpq!PQgetResult+0x188
psql+0x4c0f
psql+0x954d
psql+0x11e7
psql+0x1238
kernel32!IsProcessorFeaturePresent+0x9e

With more detail:

 # ChildEBP RetAddr  Args to Child
00 0022f768 7c822124 71b23a09 000007a8 00000001
ntdll!KiFastSystemCallRet (FPO: [0,0,0])
01 0022f76c 71b23a09 000007a8 00000001 0022f794
ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0])
02 0022f7a8 71b23a52 000007a8 00000780 00000000
mswsock!SockWaitForSingleObject+0x19d (FPO: [Non-Fpo])
03 0022f898 71c0470c 00000781 0022fc40 0022fb30 mswsock!WSPSelect+0x380
(FPO: [Non-Fpo])
04 0022f8e8 6310830b 00000781 0022fc40 0022fb30 WS2_32!select+0xb9 (FPO:
[Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be
wrong.
05 0022fd68 631084b1 00000000 ffffffff 0000001d libpq!PQenv2encoding+0x1fb
06 0022fd88 63108518 00000001 00000000 00614e70 libpq!PQenv2encoding+0x3a1
07 0022fda8 631060f8 00000001 00000000 00614e70 libpq!PQenv2encoding+0x408
08 0022fdc8 63106228 00614e70 00613a71 00615188 libpq!PQgetResult+0x58
09 0022fde8 00404c0f 00614e70 00613a71 0041ac7a libpq!PQgetResult+0x188
0a 0022fe98 0040954d 00613a71 00423180 00423185 psql+0x4c0f
0b 0022ff78 004011e7 00000006 00613b08 00612aa8 psql+0x954d
0c 0022ffb0 00401238 00000001 00000009 0022fff0 psql+0x11e7
0d 0022ffc0 77e523e5 00000000 00000000 7ffdc000 psql+0x1238
0e 0022fff0 00000000 00401220 00000000 78746341
kernel32!IsProcessorFeaturePresent+0x9e

the pg_locks table:
-[ RECORD 1 ]-+----------------
locktype      | relation
database      | 19553
relation      | 10342
page          |
tuple         |
transactionid |
classid       |
objid         |
objsubid      |
transaction   | 1998424
pid           | 576
mode          | AccessShareLock
granted       | t
-[ RECORD 2 ]-+----------------
locktype      | transactionid
database      |
relation      |
page          |
tuple         |
transactionid | 1998424
classid       |
objid         |
objsubid      |
transaction   | 1998424
pid           | 576
mode          | ExclusiveLock
granted       | t


The call stack on the postgres.exe process id 576:

ntdll!KiFastSystemCallRet
ntdll!NtWaitForMultipleObjects+0xc
WARNING: Stack unwind information not available. Following frames may be
wrong.
kernel32!ResetEvent+0x45
postgres!pgwin32_waitforsinglesocket+0x89
postgres!pgwin32_recv+0x82
postgres!secure_read+0x7b
postgres!TouchSocketFile+0x93
postgres!pq_getbyte+0x22
postgres!PostgresMain+0x1056
postgres!SubPostmasterMain+0x9ca
postgres!main+0x33f
postgres+0x11e7
postgres+0x1238
kernel32!IsProcessorFeaturePresent+0x9e


These are the parameters:

listen_addresses = '*'
port = 5432
max_connections = 300
shared_buffers = 30000
temp_buffers = 5000
work_mem = 4096
max_fsm_pages = 25000
max_fsm_relations = 500
vacuum_cost_delay = 50
wal_buffers = 32
checkpoint_segments = 16
effective_cache_size = 50000
random_page_cost = 3
default_statistics_target = 300
log_destination = 'stderr'
redirect_stderr = on

(Since there's 24gb RAM on this thing, I've apparently gotten the system
cache up to about 6 gb, according sysinternals.com "System Information"
applet.)

"Cache Data Map Hits %" runs at 100%
Physical Disk Queue Lengths are almost non-existent
Processors are not very busy at all

Seems like once something is in memory, we never have to go back to the
disk again except to write. (hope so with that kind of RAM).


>>
>> Carlo,
>>
>> Many things can happen at commit time. Temp tables dropped, TRUNCATEd
>> old relations unlinked, init files removed, deferred foreign key checks
>> (and subsequent cascading), dropped tables flushed. The assumption that
>> COMMIT is a short request may not be correct according to the wide range
>> of tasks that could occur according to standard SQL:2003 behaviour.
>>
>> Some of those effects take longer on larger systems. Any and all of
>> those things have potential secondary effects, all of which can also
>> conflict with other user tasks and especially with a CHECKPOINT. Then
>> there's various forms of contention caused by misconfiguration.
>>
>> I do think we need some better instrumentation for this kind of thing.
>>
>> --
>>   Simon Riggs
>>   EnterpriseDB   http://www.enterprisedb.com
>
> start here:
> http://beta.linuxports.com/pgsql-hackers-win32/2005-08/msg00051.php
>
> merlin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>