Обсуждение: PostgreSQL 9.0.4 blocking in lseek?

Поиск
Список
Период
Сортировка

PostgreSQL 9.0.4 blocking in lseek?

От
Sören Meyer-Eppler
Дата:
We are running into a significant performance issue with
"PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
(Debian 4.4.5-8) 4.4.5, 64-bit"
(the version string pqAdmin displays).

A fairly complex insert query on an empty destination table will run for
an indefinite amount of time (we waited literally days for the query to
complete). This does not happen every time we run the query but often.
Now ordinarily I'd assume we did something wrong with our indices or
query, but the execution plan looks sane and, more tellingly, we have:
- no CPU load
- no network traffic
- no disk I/O
- no other load on the server except this single query
and strace displaying a seemingly endless list of lseek calls.

So my assumption is that we are not running into bad Big-O() runtime
behavior but rather into some locking problem.

Any ideas what might cause this? Workarounds we could try?

thank you,

     Sören

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Gregg Jaskiewicz
Дата:
What does 'select * from pg_stat_activity' say, more precisely - the
"waiting" column.

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Sören Meyer-Eppler
Дата:
On 2011-10-27 09:42, Gregg Jaskiewicz wrote:
> What does 'select * from pg_stat_activity' say, more precisely - the
> "waiting" column.

Waiting is "false" for all rows. If I use pgadmin to lock at the server
status all locks have been granted for hours.

Although the process does in fact use CPU this time, strace still gives
a seemingly endless list of lseek calls.

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Claudio Freire
Дата:
On Thu, Oct 27, 2011 at 4:42 AM, Gregg Jaskiewicz <gryzman@gmail.com> wrote:
> What does 'select * from pg_stat_activity' say, more precisely - the
> "waiting" column.

Whether that particular process is waiting for it to be granted some
kind of database-level lock.

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Tom Lane
Дата:
=?ISO-8859-1?Q?S=F6ren_Meyer-Eppler?= <soerenme@google.com> writes:
> A fairly complex insert query on an empty destination table will run for
> an indefinite amount of time (we waited literally days for the query to
> complete). This does not happen every time we run the query but often.
> Now ordinarily I'd assume we did something wrong with our indices or
> query, but the execution plan looks sane and, more tellingly, we have:
> - no CPU load
> - no network traffic
> - no disk I/O
> - no other load on the server except this single query
> and strace displaying a seemingly endless list of lseek calls.

> So my assumption is that we are not running into bad Big-O() runtime
> behavior but rather into some locking problem.

If it were blocked on a lock, it wouldn't be doing lseeks().

The most obvious explanation for a backend that's doing repeated lseeks
and nothing else is that it's repeatedly doing seqscans on a table
that's fully cached in shared buffers.  I'd wonder about queries
embedded in often-executed plpgsql functions, for instance.  Can you
identify which table the lseeks are issued against?

(Now, having said that, I don't see how that type of theory explains no
CPU load.  But you're really going to need to provide more info before
anyone can explain it, and finding out what the lseeks are on would be
one good step.)

            regards, tom lane

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Josh Berkus
Дата:
On 10/26/11 8:47 AM, Sören Meyer-Eppler wrote:
> A fairly complex insert query on an empty destination table will run for
> an indefinite amount of time (we waited literally days for the query to
> complete). This does not happen every time we run the query but often.
> Now ordinarily I'd assume we did something wrong with our indices or
> query, but the execution plan looks sane and, more tellingly, we have:
> - no CPU load
> - no network traffic
> - no disk I/O
> - no other load on the server except this single query
> and strace displaying a seemingly endless list of lseek calls.

Hmmmm.  If you were on Solaris or OSX, I'd say you'd hit one of their
I/O bugs which can cause endless lseeks for individual disk pages.
However, I've never seen that particular pattern on Linux (other I/O
bugs, but not that one).

Questions:
(1) is it *only* that query?
(2) is there some reason you might have excessive disk fragmentation,
like running on a VM?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Sören Meyer-Eppler
Дата:
> (1) is it *only* that query?

No. There seem to be one or two others exhibiting similarly bad performance.

> (2) is there some reason you might have excessive disk fragmentation,
> like running on a VM?

No VM. The database is the only thing running on the server. Filesystem
is XFS.

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Sören Meyer-Eppler
Дата:
> embedded in often-executed plpgsql functions, for instance.  Can you
> identify which table the lseeks are issued against?

I wouldn't know how? I'm just using htop and "s" on the postgres process
to find these...

> (Now, having said that, I don't see how that type of theory explains no
> CPU load.

My bad sorry. I was relaying information from the guy administering the
server. It turns out that "no CPU load" really meant: only one of the
cores is being utilized. On a 16 core machine that looks like "no load"
but of course for the individual query still means 100%.

> But you're really going to need to provide more info before
> anyone can explain it, and finding out what the lseeks are on would be
> one good step.)

I have attached two of the offending execution plans. Anything obviously
wrong with them?

thank you for looking into it!

     Sören

Вложения

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Tom Lane
Дата:
=?ISO-8859-1?Q?S=F6ren_Meyer-Eppler?= <soerenme@google.com> writes:
>> embedded in often-executed plpgsql functions, for instance.  Can you
>> identify which table the lseeks are issued against?

> I wouldn't know how? I'm just using htop and "s" on the postgres process
> to find these...

Note the file number appearing in the lseeks, run "lsof -p PID" against
the backend process to discover the actual filename of that file, then
look for a match to the filename in pg_class.relfilenode.

> I have attached two of the offending execution plans. Anything obviously
> wrong with them?

What makes you say these are "offending execution plans"?  Both of them
seem to be completing just fine.

            regards, tom lane

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Sören Meyer-Eppler
Дата:
> Note the file number appearing in the lseeks, run "lsof -p PID" against
> the backend process to discover the actual filename of that file, then
> look for a match to the filename in pg_class.relfilenode.

Will do. I need to reproduce the error first which may take a while.

>> I have attached two of the offending execution plans. Anything obviously
>> wrong with them?
>
> What makes you say these are "offending execution plans"?  Both of them
> seem to be completing just fine.

That's exactly the point. The plan looks good, the execution times will
usually be good. But sometimes, for no immediately obvious reasons,
they'll run for hours.
I know these are the offending queries because this is what
"select now() - query_start, current_query from pg_stat_activity"
will tell me.

We execute these queries from a Java program via JDBC
(postgresql-9.1-901.jdbc4.jar). But since little data is being
transferred between the Java client and the database I hope that cannot
be the issue.

Anything else I should look out for/log during the next test run?

    Sören

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Matteo Beccati
Дата:
Hi everyone,

On 28/10/2011 02:32, Tom Lane wrote:
> =?ISO-8859-1?Q?S=F6ren_Meyer-Eppler?= <soerenme@google.com> writes:
>> A fairly complex insert query on an empty destination table will run for
>> an indefinite amount of time (we waited literally days for the query to
>> complete). This does not happen every time we run the query but often.
>> Now ordinarily I'd assume we did something wrong with our indices or
>> query, but the execution plan looks sane and, more tellingly, we have:
>> - no CPU load
>> - no network traffic
>> - no disk I/O
>> - no other load on the server except this single query
>> and strace displaying a seemingly endless list of lseek calls.
>
>> So my assumption is that we are not running into bad Big-O() runtime
>> behavior but rather into some locking problem.
>
> If it were blocked on a lock, it wouldn't be doing lseeks().
>
> The most obvious explanation for a backend that's doing repeated lseeks
> and nothing else is that it's repeatedly doing seqscans on a table
> that's fully cached in shared buffers.  I'd wonder about queries
> embedded in often-executed plpgsql functions, for instance.  Can you
> identify which table the lseeks are issued against?

I'm resuming the old thread as we've recently been hit twice by a
similar issue, both on 9.1.1 and 9.1.2. Details follow the quoted part.

The first time we've seen such behaviour was a few weeks ago, using
9.1.1. I've decided to avoid posting as I messed up with gdb while
trying to gather more information. This time, I've kept the query
running, just in case I need to provide more information.

So, here they are:

Linux (ubuntu oneiric x64) on and EC2 instance (8 virtual cores, 68GB
RAM). Storage is on EBS using xfs as filesystem. Currently using
PostgreSQL 9.1.2 manually compiled.

A rather long import process is stuck since ~12hrs on a query that
normally returns in a few seconds or minutes. Kust like to OP there's
almost no activity on the box, only a single postgres process at 100% CPU.

Two separate queries, (although similar) have triggered the issue, and
they are run one after another. Also, the import is also not always
failing, I think it happened twice in about 10 runs. We're expected to
go live in about two weeks so these were only test runs.

The explain looks like this:
http://explain.depesz.com/s/TqD

The strace output:

...
2011-12-28 09:33:59.506037500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.555167500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.604315500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.653482500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:33:59.676128500 lseek(67, 0, SEEK_END) = 65511424
2011-12-28 09:33:59.676287500 write(67,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192
...
2011-12-28 09:34:04.935944500 lseek(64, 0, SEEK_END) = 27623424
2011-12-28 09:34:04.945279500 lseek(67, 0, SEEK_END) = 65519616
2011-12-28 09:34:04.945394500 write(67,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192
...

so an lseek on fd 64 roughly every 50ms and lseek/write to fd 67 every
5s. Lsof shows:

postgres 1991 postgres   64u   REG              252,0   27623424
134250744 /data/postgresql/main/base/16387/44859
postgres 1991 postgres   67u   REG              252,0   65732608
134250602 /data/postgresql/main/base/16387/44675

with 64 being the carts table and 67 cart_products, according to
pg_class' relfilenode. All the tables involved are regular tables but
log_conversion_item, which is unlogged.

I'm also trying an EXPLAIN ANALYZE for the SELECT part, but it seems to
take a while too and is seemingly calling only gettimeofday.

Here are the custom values in postgresql.conf:

max_connections = 200
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 48GB
work_mem = 128MB
wal_buffers = 16MB
checkpoint_segments = 64
shared_buffers = 8GB
synchronous_commit = off
random_page_cost = 2

archive_mode = off
wal_level = hot_standby
max_wal_senders = 2
wal_keep_segments = 256

Streaming replication was accidentally left enabled during the run, but
I'm pretty sure that I didn't set it up yet when we had the previous
failure.

I'll be happy to provide more information. For now we're going to
disconnect the slave and use that for another test import run.


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Matteo Beccati
Дата:
Hi,

> I'm also trying an EXPLAIN ANALYZE for the SELECT part, but it seems to
> take a while too and is seemingly calling only gettimeofday.

An update on this. Apart from gettimeofday calls, which I filtered out
when logging, I've seen about 80 lseeks recorded every 1h10m (each 1us
apart):

2011-12-28 11:06:25.546661500 lseek(10, 0, SEEK_END) = 27623424
2011-12-28 11:06:25.546662500 lseek(10, 0, SEEK_END) = 27623424
2011-12-28 11:06:25.546663500 lseek(10, 0, SEEK_END) = 27623424
...
2011-12-28 12:16:56.144663500 lseek(10, 0, SEEK_END) = 27623424
...
2011-12-28 13:28:20.436549500 lseek(10, 0, SEEK_END) = 27623424
...

I've then decided to interrupt the EXPLAIN ANALYZE. But I suppose that
the database as it is now will likely allow to further debug what's
happening.


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Matteo Beccati
Дата:
Hi,

> A rather long import process is stuck since ~12hrs on a query that
> normally returns in a few seconds or minutes. Kust like to OP there's
> almost no activity on the box, only a single postgres process at 100% CPU.
>
> Two separate queries, (although similar) have triggered the issue, and
> they are run one after another. Also, the import is also not always
> failing, I think it happened twice in about 10 runs. We're expected to
> go live in about two weeks so these were only test runs.
>
> The explain looks like this:
> http://explain.depesz.com/s/TqD

The query eventually completed in more than 18h. For comparison a normal
run doesn't take more than 1m for that specific step.

Do you think that bad stats and suboptimal plan alone could explain such
a behaviour?


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Claudio Freire
Дата:
On Wed, Dec 28, 2011 at 3:02 PM, Matteo Beccati <php@beccati.com> wrote:
> The query eventually completed in more than 18h. For comparison a normal
> run doesn't take more than 1m for that specific step.
>
> Do you think that bad stats and suboptimal plan alone could explain such
> a behaviour?

Did you get the explain analyze output?

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Matteo Beccati
Дата:
On 28/12/2011 19:07, Claudio Freire wrote:
> On Wed, Dec 28, 2011 at 3:02 PM, Matteo Beccati <php@beccati.com> wrote:
>> The query eventually completed in more than 18h. For comparison a normal
>> run doesn't take more than 1m for that specific step.
>>
>> Do you think that bad stats and suboptimal plan alone could explain such
>> a behaviour?
>
> Did you get the explain analyze output?

Unfortunately I stopped it as I thought it wasn't going to return
anything meaningful. I've restarted the import process and it will break
right before the problematic query. Let's see if I can get any more info
tomorrow.


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

Re: PostgreSQL 9.0.4 blocking in lseek?

От
Matteo Beccati
Дата:
On 28/12/2011 19:41, Matteo Beccati wrote:
> On 28/12/2011 19:07, Claudio Freire wrote:
>> On Wed, Dec 28, 2011 at 3:02 PM, Matteo Beccati <php@beccati.com> wrote:
>>> The query eventually completed in more than 18h. For comparison a normal
>>> run doesn't take more than 1m for that specific step.
>>>
>>> Do you think that bad stats and suboptimal plan alone could explain such
>>> a behaviour?
>>
>> Did you get the explain analyze output?
>
> Unfortunately I stopped it as I thought it wasn't going to return
> anything meaningful. I've restarted the import process and it will break
> right before the problematic query. Let's see if I can get any more info
> tomorrow.

So, I'm running again the EXPLAIN ANALYZE, although I don't expect it to
return anytime soon.

However I've discovered a few typos in the index creation. If we add it
to the fact that row estimates are off for this specific query, I can
understand that the chosen plan might have been way far from optimal
with some badly picked statistics.

This is the explain analyze of the query with proper indexes in place.
As you can see estimates are still off, even though run time is ~20s:

http://explain.depesz.com/s/1UY

For comparison, here is the old explain output:

http://explain.depesz.com/s/TqD

The case is closed and as Tom pointed out already the lseek-only
activity is due to the fact that the table is fully cached in the shared
buffers and a sequential scan inside a nested loop is consistent with it.

Sorry for the noise.


Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/