Обсуждение: Odd blocking (or massively latent) issue - even with EXPLAIN

Поиск
Список
Период
Сортировка

Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
(First, apologies if this post now appears twice - it appears our mail server rewrites my address!)

Hello all. I'm a pgsql performance virgin so hope I cross all the 't's
and dot the lower-case 'j's when posting this query...

On our production database server we're experiencing behaviour that
several engineers are unable to explain - hence this Email. First, our
specs;

Scientific Linux 6.2, kernel 2.6.32
PG version 9.1.3, release 1PGDG.rhel6
24GB RAM
8 cores
2x software SSD-based RAIDs:
  a) ~660GB, RAID 5, 4 SSDs (data)
  b) ~160GB, RAID 1, 2 SSDs (xlogs + tmp tables)

We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
statements hang indefinitely until *something* (we don't know what)
releases some kind of resource or no longer becomes a massive bottle
neck. These are the symptoms.

However, the system seems healthy - no table ('heavyweight') locks are
held by any session (this happens with only a few connected sessions),
all indexes are used correctly, other transactions are writing data (we
generally only have a few sessions running at a time - perhaps 10) etc.
etc. In fact, we are writing (or bgwriter is), 2-3 hundred MB/s
sometimes.

We regularly run vacuum analyze at quiet periods - generally 1-2s daily.

These sessions (that only read data) that are blocked can block from
anything from between only 5 minutes to 10s of hours then miraculously
complete successfully at once.

Any suggestions for my next avenue of investigation? I'll try and
capture more data by observation next time it happens (it is relatively
intermittent).

Regards,

Jim

PS. These are the settings that differ from the default:

checkpoint_segments = 128
maintenance_work_mem = 256MB
synchronous_commit = off
random_page_cost = 3.0
wal_buffers = 16MB
shared_buffers = 8192MB
checkpoint_completion_target = 0.9
effective_cache_size = 18432MB
work_mem = 32MB
effective_io_concurrency = 12
max_stack_depth = 8MB
log_autovacuum_min_duration = 0
log_lock_waits = on
autovacuum_vacuum_scale_factor = 0.1
autovacuum_naptime = 8
autovacuum_max_workers = 4

PPS. I've just noticed that our memory configuration is over subscribed!
     shared_buffers + effective_cache_size > Total available RAM! Could
     this be the root cause somehow?

--
Jim Vanns
Systems Programmer
Framestore


Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
"Martin French"
Дата:
<p><font face="sans-serif" size="2">Hi</font><br /><br /><tt><font size="2">> <br />> We're seeing SELECT
statementsand even EXPLAIN (no ANAYLZE) <br />> statements hang indefinitely until *something* (we don't know
what)<br/>> releases some kind of resource or no longer becomes a massive bottle<br />> neck. These are the
symptoms.<br/></font></tt><br /><tt><font size="2">Is this in pgAdmin? Or psql on the console?</font></tt><br
/><tt><fontsize="2"><br />> However, the system seems healthy - no table ('heavyweight') locks are<br />> held by
anysession (this happens with only a few connected sessions),<br />> all indexes are used correctly, other
transactionsare writing data (we<br />> generally only have a few sessions running at a time - perhaps 10) etc.<br
/>>etc. In fact, we are writing (or bgwriter is), 2-3 hundred MB/s<br />> sometimes.<br /></font></tt><br
/><tt><fontsize="2">What is shown in "top" and "iostat" whilst the queries are running?</font></tt><br /><br
/><tt><fontsize="2">> <br />> We regularly run vacuum analyze at quiet periods - generally 1-2s daily.<br />>
<br/>> These sessions (that only read data) that are blocked can block from<br />> anything from between only 5
minutesto 10s of hours then miraculously<br />> complete successfully at once.<br />> <br /></font></tt><br
/><tt><fontsize="2">Are any "blockers" shown in pg_stat_activity?<br /></font></tt><br /><tt><font size="2">> <br
/>>checkpoint_segments = 128<br />> maintenance_work_mem = 256MB<br />> synchronous_commit = off<br />>
random_page_cost= 3.0<br />> wal_buffers = 16MB<br />> shared_buffers = 8192MB<br />>
checkpoint_completion_target= 0.9<br />> effective_cache_size = 18432MB<br />> work_mem = 32MB<br />>
effective_io_concurrency= 12<br />> max_stack_depth = 8MB<br />> log_autovacuum_min_duration = 0<br />>
log_lock_waits= on<br />> autovacuum_vacuum_scale_factor = 0.1<br />> autovacuum_naptime = 8<br />>
autovacuum_max_workers= 4<br /></font></tt><br /><tt><font size="2">Memory looks reasonably configured to me.
effective_cache_sizeis only an indication to the planner and is not actually allocated. <br /></font></tt><br
/><tt><fontsize="2">Is anything being written to the logfiles?</font></tt><br /><br /><tt><font
size="2">Cheers</font></tt><fontface="sans-serif">============================================= Romax Technology
LimitedRutherford House Nottingham Science & Technology Park Nottingham, NG7 2PZ England Telephone numbers: +44
(0)115951 88 00 (main) For other office locations see: http://www.romaxtech.com/Contact
================================================ E-mail: info@romaxtech.com Website: www.romaxtech.com
================================================= Confidentiality Statement This transmission is for the addressee only
andcontains information that is confidential and privileged. Unless you are the named addressee, or authorised to
receiveit on behalf of the addressee you may not copy or use it, or disclose it to anyone else. If you have received
thistransmission in error please delete from your system and contact the sender. Thank you for your cooperation.
=================================================</font>

Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Andrew Dunstan
Дата:
On 07/23/2012 04:41 AM, Jim Vanns wrote:
> We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
> statements hang indefinitely until *something* (we don't know what)
> releases some kind of resource or no longer becomes a massive bottle
> neck. These are the symptoms.


I have seen this sort of behaviour on systems with massive catalogs
(millions of tables and indexes). Could that be your problem?


cheers

andrew

Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
Thank you all for your replies, I shall try and qualify and confirm...

On Mon, 2012-07-23 at 14:46 +0100, Martin French wrote:
> Hi
>
> >
> > We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
> > statements hang indefinitely until *something* (we don't know what)
> > releases some kind of resource or no longer becomes a massive bottle
> > neck. These are the symptoms.
>
> Is this in pgAdmin? Or psql on the console?
>
psql

> > However, the system seems healthy - no table ('heavyweight') locks
> are
> > held by any session (this happens with only a few connected
> sessions),
> > all indexes are used correctly, other transactions are writing data
> (we
> > generally only have a few sessions running at a time - perhaps 10)
> etc.
> > etc. In fact, we are writing (or bgwriter is), 2-3 hundred MB/s
> > sometimes.
>
> What is shown in "top" and "iostat" whilst the queries are running?

Generally, lots of CPU churn (90-100%) and a fair bit of I/O wait.
iostat reports massive reads (up to 300MB/s).

> >
> > We regularly run vacuum analyze at quiet periods - generally 1-2s
> daily.

(this is to answer to someone who didn't reply to the list)

We run full scans using vacuumdb so don't just rely on autovacuum. The
small table is so small (<50 tuples) a sequence scan is always
performed.

> > These sessions (that only read data) that are blocked can block from
> > anything from between only 5 minutes to 10s of hours then
> miraculously
> > complete successfully at once.
> >
>
> Are any "blockers" shown in pg_stat_activity?

None. Ever. Nothing in pg_locks either.

> >
> > checkpoint_segments = 128
> > maintenance_work_mem = 256MB
> > synchronous_commit = off
> > random_page_cost = 3.0
> > wal_buffers = 16MB
> > shared_buffers = 8192MB
> > checkpoint_completion_target = 0.9
> > effective_cache_size = 18432MB
> > work_mem = 32MB
> > effective_io_concurrency = 12
> > max_stack_depth = 8MB
> > log_autovacuum_min_duration = 0
> > log_lock_waits = on
> > autovacuum_vacuum_scale_factor = 0.1
> > autovacuum_naptime = 8
> > autovacuum_max_workers = 4
>
> Memory looks reasonably configured to me. effective_cache_size is only
> an indication to the planner and is not actually allocated.

I realise that.

> Is anything being written to the logfiles?

Nothing obvious - and we log a fair amount. No tmp table creations,
no locks held.

To add to this EXPLAIN reports it took only 0.23ms to run (for example)
whereas the wall clock time is more like 20-30 minutes (or up to n hours
as I said where everything appears to click back into place at the same
time).

Thanks.

Jim

> Cheers============================================= Romax Technology
> Limited Rutherford House Nottingham Science & Technology Park
> Nottingham, NG7 2PZ England Telephone numbers: +44 (0)115 951 88 00
> (main) For other office locations see:
> http://www.romaxtech.com/Contact =================================
> =============== E-mail: info@romaxtech.com Website: www.romaxtech.com
> ================================= ================ Confidentiality
> Statement This transmission is for the addressee only and contains
> information that is confidential and privileged. Unless you are the
> named addressee, or authorised to receive it on behalf of the
> addressee you may not copy or use it, or disclose it to anyone else.
> If you have received this transmission in error please delete from
> your system and contact the sender. Thank you for your cooperation.
> =================================================
>

--
Jim Vanns
Systems Programmer
Framestore


Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
On Mon, 2012-07-23 at 09:53 -0400, Andrew Dunstan wrote:
> On 07/23/2012 04:41 AM, Jim Vanns wrote:
> > We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
> > statements hang indefinitely until *something* (we don't know what)
> > releases some kind of resource or no longer becomes a massive bottle
> > neck. These are the symptoms.
>
> I have seen this sort of behaviour on systems with massive catalogs
> (millions of tables and indexes). Could that be your problem?

Possibly. I'm not familiar with the catalogs. I'll look into that.

Thanks,

Jim

>
> cheers
>
> andrew
>

--
Jim Vanns
Systems Programmer
Framestore


Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Tom Lane
Дата:
Jim Vanns <james.vanns@framestore.com> writes:
> We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
> statements hang indefinitely until *something* (we don't know what)
> releases some kind of resource or no longer becomes a massive bottle
> neck. These are the symptoms.

Does anything show up as blocked in the pg_locks view?

Could you attach to the stuck process with gdb and get a stack trace?

            regards, tom lane

Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
On Mon, 2012-07-23 at 11:09 -0400, Tom Lane wrote:
> Jim Vanns <james.vanns@framestore.com> writes:
> > We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
> > statements hang indefinitely until *something* (we don't know what)
> > releases some kind of resource or no longer becomes a massive bottle
> > neck. These are the symptoms.
>
> Does anything show up as blocked in the pg_locks view?

Nope.

> Could you attach to the stuck process with gdb and get a stack trace?

Haven't been quite brave enough to do that yet - this is a production
server. I did manage to strace a process though - it (the server side
process of a psql EXPLAIN) appeared to spin on an awful lot of semop()
calls with the occasional read(). Of course, in the context of a shared
memory system such as postgres I'd expect to see quite a lot of semop()
calls but I've no idea how much is normal and how much is excessive.

Jim

>             regards, tom lane
>

--
Jim Vanns
Systems Programmer
Framestore


Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Craig Ringer
Дата:
On 07/23/2012 10:46 PM, Jim Vanns wrote:
> Nothing obvious - and we log a fair amount. No tmp table creations,
> no locks held.
>
> To add to this EXPLAIN reports it took only 0.23ms to run (for example)
> whereas the wall clock time is more like 20-30 minutes (or up to n hours
> as I said where everything appears to click back into place at the same
> time).
How many concurrent connections do you have?

--
Craig Ringer

Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
"Martin French"
Дата:
<p><tt><font size="2">> > Hi<br />> > <br />> > > <br />> > > We're seeing SELECT
statementsand even EXPLAIN (no ANAYLZE) <br />> > > statements hang indefinitely until *something* (we don't
knowwhat)<br />> > > releases some kind of resource or no longer becomes a massive bottle<br />> > >
neck.These are the symptoms.<br />> > <br />> > Is this in pgAdmin? Or psql on the console?<br />> >
<br/>> psql<br />> <br />> > > However, the system seems healthy - no table ('heavyweight') locks<br
/>>> are<br />> > > held by any session (this happens with only a few connected<br />> >
sessions),<br/>> > > all indexes are used correctly, other transactions are writing data<br />> > (we<br
/>>> > generally only have a few sessions running at a time - perhaps 10)<br />> > etc.<br />> >
>etc. In fact, we are writing (or bgwriter is), 2-3 hundred MB/s<br />> > > sometimes.<br />> > <br
/>>> What is shown in "top" and "iostat" whilst the queries are running?<br />> <br />> Generally, lots of
CPUchurn (90-100%) and a fair bit of I/O wait.<br />> iostat reports massive reads (up to 300MB/s).<br
/></font></tt><br/><tt><font size="2">This looks like this is a pure IO issue. You mentioned that this was a software
RAIDsystem. I wonder if there's some complication there.</font></tt><br /><br /><tt><font size="2">Have you tried
settingthe disk queues to deadline?</font></tt><br /><br /><tt><font size="2">echo "deadline" >
/sys/block/{DEVICE-NAME}/queue/scheduler</font></tt><br/><br /><tt><font size="2">That might help. But to be honest, it
reallydoes sound disk/software raid related with the CPU and IO being so high.</font></tt><br /><br /><tt><font
size="2">Canyou attempt to replicate the problem on another system without software RAID?</font></tt><br /><br
/><tt><fontsize="2">Also, you might want to try a disk test on the machine, it's 24GB ram right?</font></tt><br /><br
/><tt><fontsize="2">so, try the following tests on the Postgres data disk (you'll obviously need lots of space for
this):</font></tt><br/><br /><br /><tt><font size="2">Write Test: </font></tt><br /><tt><font size="2"> time sh -c "dd
if=/dev/zeroof=bigfile bs=8k count=6000000 && sync"</font></tt><br /><br /><tt><font size="2">Read
Test:</font></tt><br/><tt><font size="2"> time dd if=bigfile of=/dev/null bs=8k</font></tt><br /><br /><tt><font
size="2">(Tests taken from Greg Smiths page: </font></tt><a
href="http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm"><fontcolor="#0000FF" face="serif"
size="3"><u>http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm</u></font></a><fontface="serif"
size="3"> </font><tt><fontsize="2">)</font></tt><br /><br /><tt><font size="2">> <br />> > > <br />>
>> We regularly run vacuum analyze at quiet periods - generally 1-2s<br />> > daily.<br />> <br />>
(thisis to answer to someone who didn't reply to the list)<br />> <br />> We run full scans using vacuumdb so
don'tjust rely on autovacuum. The<br />> small table is so small (<50 tuples) a sequence scan is always<br />>
performed.<br/>> <br />> > > These sessions (that only read data) that are blocked can block from<br />>
>> anything from between only 5 minutes to 10s of hours then<br />> > miraculously<br />> > >
completesuccessfully at once.<br />> > > <br />> > <br />> > Are any "blockers" shown in
pg_stat_activity?<br/>> <br />> None. Ever. Nothing in pg_locks either.<br />> <br />> > > <br />>
>> checkpoint_segments = 128<br />> > > maintenance_work_mem = 256MB<br />> > >
synchronous_commit= off<br />> > > random_page_cost = 3.0<br />> > > wal_buffers = 16MB<br />>
>> shared_buffers = 8192MB<br />> > > checkpoint_completion_target = 0.9<br />> > >
effective_cache_size= 18432MB<br />> > > work_mem = 32MB<br />> > > effective_io_concurrency = 12<br
/>>> > max_stack_depth = 8MB<br />> > > log_autovacuum_min_duration = 0<br />> > >
log_lock_waits= on<br />> > > autovacuum_vacuum_scale_factor = 0.1<br />> > > autovacuum_naptime =
8<br/>> > > autovacuum_max_workers = 4<br />> > <br />> > Memory looks reasonably configured to
me.effective_cache_size is only<br />> > an indication to the planner and is not actually allocated. <br />>
<br/>> I realise that.<br />> <br />> > Is anything being written to the logfiles?<br />> <br />>
Nothingobvious - and we log a fair amount. No tmp table creations,<br />> no locks held. <br />> <br />> To
addto this EXPLAIN reports it took only 0.23ms to run (for example)<br />> whereas the wall clock time is more like
20-30minutes (or up to n hours<br />> as I said where everything appears to click back into place at the same<br
/>>time).<br />> <br />> Thanks.<br />> <br /></font></tt><br /><tt><font size="2">Something else you might
wantto try is running with a default Postgresql.conf, if the query/explain then runs fine, then that would lead me to
believethat there is a configuration issue. Although I'm pretty convinced that it may be the disk set up.
</font></tt><br/><br /><tt><font size="2">Cheers<br /></font></tt><font
face="sans-serif">=============================================Romax Technology Limited Rutherford House Nottingham
Science& Technology Park Nottingham, NG7 2PZ England Telephone numbers: +44 (0)115 951 88 00 (main) For other
officelocations see: http://www.romaxtech.com/Contact ================================= =============== E-mail:
info@romaxtech.comWebsite: www.romaxtech.com ================================= ================ Confidentiality
StatementThis transmission is for the addressee only and contains information that is confidential and privileged.
Unlessyou are the named addressee, or authorised to receive it on behalf of the addressee you may not copy or use it,
ordisclose it to anyone else. If you have received this transmission in error please delete from your system and
contactthe sender. Thank you for your cooperation. =================================================</font> 

Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
On Tue, 2012-07-24 at 08:30 +0800, Craig Ringer wrote:
> On 07/23/2012 10:46 PM, Jim Vanns wrote:
> > Nothing obvious - and we log a fair amount. No tmp table creations,
> > no locks held.
> >
> > To add to this EXPLAIN reports it took only 0.23ms to run (for example)
> > whereas the wall clock time is more like 20-30 minutes (or up to n hours
> > as I said where everything appears to click back into place at the same
> > time).
> How many concurrent connections do you have?

Between 4 and 64 at peak! max_connections is only set to 100.

Jim

> --
> Craig Ringer

--
Jim Vanns
Systems Programmer
Framestore


Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
> This looks like this is a pure IO issue. You mentioned that this was a
> software RAID system. I wonder if there's some complication there.
>
> Have you tried setting the disk queues to deadline?
>
> echo "deadline" > /sys/block/{DEVICE-NAME}/queue/scheduler
> That might help. But to be honest, it really does sound disk/software
> raid related with the CPU and IO being so high.
>
> Can you attempt to replicate the problem on another system without
> software RAID?
>
> Also, you might want to try a disk test on the machine, it's 24GB ram
> right?
>
> so, try the following tests on the Postgres data disk (you'll
> obviously need lots of space for this):
>
> Write Test:
> time sh -c "dd if=/dev/zero of=bigfile bs=8k count=6000000 && sync"
>
> Read Test:
> time dd if=bigfile of=/dev/null bs=8k

I've already tried something very similar using dd. No performance
penalties during a normal running of the system - or when this blocking
happens either actually. But I agree, it does indeed sound like some
sort of I/O problem. I just don't know what! I do have a few more tricks
up my sleeve that I'll try today. I'll post any results that I have.

That latter test - won't that pretty much just read from the page cache?
'sync' may well have forced dirty pages to disk but does it actually
evict them to?

Anyway, that is off topic... perhaps ;)

Thanks again,

Jim

> ( Tests taken from Greg Smiths page:
> http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm )
>
> >
> > > >
> > > > We regularly run vacuum analyze at quiet periods - generally
> 1-2s
> > > daily.
> >
> > (this is to answer to someone who didn't reply to the list)
> >
> > We run full scans using vacuumdb so don't just rely on autovacuum.
> The
> > small table is so small (<50 tuples) a sequence scan is always
> > performed.
> >
> > > > These sessions (that only read data) that are blocked can block
> from
> > > > anything from between only 5 minutes to 10s of hours then
> > > miraculously
> > > > complete successfully at once.
> > > >
> > >
> > > Are any "blockers" shown in pg_stat_activity?
> >
> > None. Ever. Nothing in pg_locks either.
> >
> > > >
> > > > checkpoint_segments = 128
> > > > maintenance_work_mem = 256MB
> > > > synchronous_commit = off
> > > > random_page_cost = 3.0
> > > > wal_buffers = 16MB
> > > > shared_buffers = 8192MB
> > > > checkpoint_completion_target = 0.9
> > > > effective_cache_size = 18432MB
> > > > work_mem = 32MB
> > > > effective_io_concurrency = 12
> > > > max_stack_depth = 8MB
> > > > log_autovacuum_min_duration = 0
> > > > log_lock_waits = on
> > > > autovacuum_vacuum_scale_factor = 0.1
> > > > autovacuum_naptime = 8
> > > > autovacuum_max_workers = 4
> > >
> > > Memory looks reasonably configured to me. effective_cache_size is
> only
> > > an indication to the planner and is not actually allocated.
> >
> > I realise that.
> >
> > > Is anything being written to the logfiles?
> >
> > Nothing obvious - and we log a fair amount. No tmp table creations,
> > no locks held.
> >
> > To add to this EXPLAIN reports it took only 0.23ms to run (for
> example)
> > whereas the wall clock time is more like 20-30 minutes (or up to n
> hours
> > as I said where everything appears to click back into place at the
> same
> > time).
> >
> > Thanks.
> >
>
> Something else you might want to try is running with a default
> Postgresql.conf, if the query/explain then runs fine, then that would
> lead me to believe that there is a configuration issue. Although I'm
> pretty convinced that it may be the disk set up.
>
> Cheers
> ============================================= Romax Technology Limited
> Rutherford House Nottingham Science & Technology Park Nottingham, NG7
> 2PZ England Telephone numbers: +44 (0)115 951 88 00 (main) For other
> office locations see: http://www.romaxtech.com/Contact
> ================================= =============== E-mail:
> info@romaxtech.com Website: www.romaxtech.com
> ================================= ================ Confidentiality
> Statement This transmission is for the addressee only and contains
> information that is confidential and privileged. Unless you are the
> named addressee, or authorised to receive it on behalf of the
> addressee you may not copy or use it, or disclose it to anyone else.
> If you have received this transmission in error please delete from
> your system and contact the sender. Thank you for your cooperation.
> =================================================
>

--
Jim Vanns
Systems Programmer
Framestore


Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
"Martin French"
Дата:
<p><tt><font size="2">Hi Jim,</font></tt><br /><br /><tt><font size="2">> <br />> I've already tried something
verysimilar using dd. No performance<br />> penalties during a normal running of the system - or when this
blocking<br/>> happens either actually. But I agree, it does indeed sound like some<br />> sort of I/O problem. I
justdon't know what! I do have a few more tricks<br />> up my sleeve that I'll try today. I'll post any results that
Ihave.<br /></font></tt><br /><tt><font size="2">Ok, let me know how you get on. </font></tt><br /><br /><tt><font
size="2">><br />> That latter test - won't that pretty much just read from the page cache?<br />> 'sync' may
wellhave forced dirty pages to disk but does it actually<br />> evict them to?<br /></font></tt><br /><tt><font
size="2">Basically,the cache is avoided because of the size of the file. 6000000 blocks at 8k exceeds the size of RAM
inthe machine, so it *should* miss the cache and hit the disk directly. :)</font></tt><br /><br /><tt><font
size="2">><br />> Anyway, that is off topic... perhaps ;)<br />> <br />> Thanks again,<br />> <br />>
Jim<br/>> <br /></font></tt><br /><tt><font size="2">Cheers</font></tt><br /><br /><tt><font size="2">Martin
</font></tt><fontface="sans-serif">============================================= Romax Technology Limited Rutherford
HouseNottingham Science & Technology Park Nottingham, NG7 2PZ England Telephone numbers: +44 (0)115 951 88 00
(main)For other office locations see: http://www.romaxtech.com/Contact =================================
===============E-mail: info@romaxtech.com Website: www.romaxtech.com ================================= ================
ConfidentialityStatement This transmission is for the addressee only and contains information that is confidential and
privileged.Unless you are the named addressee, or authorised to receive it on behalf of the addressee you may not copy
oruse it, or disclose it to anyone else. If you have received this transmission in error please delete from your system
andcontact the sender. Thank you for your cooperation. =================================================</font> 

Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
> > That latter test - won't that pretty much just read from the page
> cache?
> > 'sync' may well have forced dirty pages to disk but does it actually
> > evict them to?
>
> Basically, the cache is avoided because of the size of the file.
> 6000000 blocks at 8k exceeds the size of RAM in the machine, so it
> *should* miss the cache and hit the disk directly. :)

Doh!

> >
> > Anyway, that is off topic... perhaps ;)
> >
> > Thanks again,
> >
> > Jim
> >
>
> Cheers
>
> Martin ============================================= Romax Technology
> Limited Rutherford House Nottingham Science & Technology Park
> Nottingham, NG7 2PZ England Telephone numbers: +44 (0)115 951 88 00
> (main) For other office locations see:
> http://www.romaxtech.com/Contact =================================
> =============== E-mail: info@romaxtech.com Website: www.romaxtech.com
> ================================= ================ Confidentiality
> Statement This transmission is for the addressee only and contains
> information that is confidential and privileged. Unless you are the
> named addressee, or authorised to receive it on behalf of the
> addressee you may not copy or use it, or disclose it to anyone else.
> If you have received this transmission in error please delete from
> your system and contact the sender. Thank you for your cooperation.
> =================================================
>

--
Jim Vanns
Systems Programmer
Framestore


Re: Odd blocking (or massively latent) issue - even with EXPLAIN

От
Jim Vanns
Дата:
Hello again. So sorry for resurrecting such an old thread but the
problem still persists - I've just had very little to report on, until
now...

> > That latter test - won't that pretty much just read from the page
> cache?
> > 'sync' may well have forced dirty pages to disk but does it actually
> > evict them to?
>
> Basically, the cache is avoided because of the size of the file.
> 6000000 blocks at 8k exceeds the size of RAM in the machine, so it
> *should* miss the cache and hit the disk directly. :)

OK, I did this during the problematic time and write speeds (sustained)
are in the order of 250MB/s :) It took just 3m21s to write all ~50GB. We
get read speeds of a wonderfully massive ~530MB/s - the whole file read
in just 1m30s. All metrics gathered with iostat -p <devices> -m 1.

Now, what I have noticed is this; we run two databases on this one
machine. One (DB) is completely operable in a normal way during the slow
period and the other is not. This (unresponsive) database has just two
client processes connected - one is writing, the other is (meant to be)
reading.

Neither registers in pg_locks - one does not block the other at the DB
level. However the write process (INSERTs) is writing between 5 and 10
MB/s. The read process (a SELECT or EXPLAIN) just spins the CPU at 100%
and register 0.0 MB/s - yet it should be reading *a lot* of data.

So does PostgreSQL somehow (or have I misconfigured it to) always
prioritise writes over reads?

I'm still at a loss!

Any further pointers would be appreciated.

Jim

PS. This is with the deadline scheduler and with each block device set
with --setra 8192.

> > Anyway, that is off topic... perhaps ;)
> >
> > Thanks again,
> >
> > Jim
> >
>
> Cheers
>
> Martin ============================================= Romax Technology
> Limited Rutherford House Nottingham Science & Technology Park
> Nottingham, NG7 2PZ England Telephone numbers: +44 (0)115 951 88 00
> (main) For other office locations see:
> http://www.romaxtech.com/Contact =================================
> =============== E-mail: info@romaxtech.com Website: www.romaxtech.com
> ================================= ================ Confidentiality
> Statement This transmission is for the addressee only and contains
> information that is confidential and privileged. Unless you are the
> named addressee, or authorised to receive it on behalf of the
> addressee you may not copy or use it, or disclose it to anyone else.
> If you have received this transmission in error please delete from
> your system and contact the sender. Thank you for your cooperation.
> =================================================
>

--
Jim Vanns
Systems Programmer
Framestore