Обсуждение: upgrade from 9.2.x to 9.3 causes significant performance degradation

Поиск
Список
Период
Сортировка

upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Lonni J Friedman
Дата:
Greetings,
I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
performance degradation.  PostgreSQL simply feels slower.  Nothing
other than the version of PostgreSQL changed yesterday.  I used
pg_upgrade to perform the upgrade, and ran the generated
analyze_new_cluster.sh immediately afterwards, which completed
successfully.

Prior to the upgrade, I'd generally expect a load average of less than
2.00 on the master, and less than 1.00 on each of the slaves.  Since
the upgrade, the load average on the master has been in double digits
(hitting 100.00 for a few minutes), and the slaves are consistently
above 5.00.

There are a few things that are jumping out at me as behaving
differently since the upgrade.  vmstat processes waiting for runtime
counts have increased dramatically.  Prior to the upgrade the process
count would be consistently less than 10, however since upgrading it
hovers between 40 & 60 at all times.  /proc/interrupts "Local timer
interrupts" has increased dramatically as well.  It used to hover
around 6000 and is now over 20k much of the time.  However, I'm
starting to suspect that they are both symptoms of the problem rather
than the cause.

At this point, I'm looking for guidance on how to debug this problem
more effectively.

thanks


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Eduardo Morras
Дата:
On Tue, 17 Sep 2013 09:19:29 -0700
Lonni J Friedman <netllama@gmail.com> wrote:

> Greetings,
> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
> replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
> performance degradation.  PostgreSQL simply feels slower.  Nothing
> other than the version of PostgreSQL changed yesterday.  I used
> pg_upgrade to perform the upgrade, and ran the generated
> analyze_new_cluster.sh immediately afterwards, which completed
> successfully.
>
> Prior to the upgrade, I'd generally expect a load average of less than
> 2.00 on the master, and less than 1.00 on each of the slaves.  Since
> the upgrade, the load average on the master has been in double digits
> (hitting 100.00 for a few minutes), and the slaves are consistently
> above 5.00.
>
> There are a few things that are jumping out at me as behaving
> differently since the upgrade.  vmstat processes waiting for runtime
> counts have increased dramatically.  Prior to the upgrade the process
> count would be consistently less than 10, however since upgrading it
> hovers between 40 & 60 at all times.  /proc/interrupts "Local timer
> interrupts" has increased dramatically as well.  It used to hover
> around 6000 and is now over 20k much of the time.  However, I'm
> starting to suspect that they are both symptoms of the problem rather
> than the cause.
>
> At this point, I'm looking for guidance on how to debug this problem
> more effectively.

Don't know what happens but:

a) Does analyze_new_cluster.sh include a reindex? If not, indexs are useless because analyze statistics says so.
b) Did you configure postgresql.conf on 9.3.0 for your server/load? Perhaps it has default install values.
c) What does logs say?

> thanks
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


---   ---
Eduardo Morras <emorrasg@yahoo.es>


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Lonni J Friedman
Дата:
On Tue, Sep 17, 2013 at 9:54 AM, Eduardo Morras <emorrasg@yahoo.es> wrote:
> On Tue, 17 Sep 2013 09:19:29 -0700
> Lonni J Friedman <netllama@gmail.com> wrote:
>
>> Greetings,
>> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
>> replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
>> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
>> performance degradation.  PostgreSQL simply feels slower.  Nothing
>> other than the version of PostgreSQL changed yesterday.  I used
>> pg_upgrade to perform the upgrade, and ran the generated
>> analyze_new_cluster.sh immediately afterwards, which completed
>> successfully.
>>
>> Prior to the upgrade, I'd generally expect a load average of less than
>> 2.00 on the master, and less than 1.00 on each of the slaves.  Since
>> the upgrade, the load average on the master has been in double digits
>> (hitting 100.00 for a few minutes), and the slaves are consistently
>> above 5.00.
>>
>> There are a few things that are jumping out at me as behaving
>> differently since the upgrade.  vmstat processes waiting for runtime
>> counts have increased dramatically.  Prior to the upgrade the process
>> count would be consistently less than 10, however since upgrading it
>> hovers between 40 & 60 at all times.  /proc/interrupts "Local timer
>> interrupts" has increased dramatically as well.  It used to hover
>> around 6000 and is now over 20k much of the time.  However, I'm
>> starting to suspect that they are both symptoms of the problem rather
>> than the cause.
>>
>> At this point, I'm looking for guidance on how to debug this problem
>> more effectively.
>
> Don't know what happens but:
>
> a) Does analyze_new_cluster.sh include a reindex? If not, indexs are useless because analyze statistics says so.

No, it doesn't include a reindex.  It merely invokes "vacuumdb --all
--analyze-only" with different values for default_statistics_target=1
-c vacuum_cost_delay=0.

According to the documentation for pg_upgrade, post-upgrade scripts to
rebuild tables and indexes will be generated automatically.  Nothing
was generated for this purpose, at least not in any obvious place.
The analyze_new_cluster.sh script is the only one that was
automatically generated as far as I can tell.

> b) Did you configure postgresql.conf on 9.3.0 for your server/load? Perhaps it has default install values.

Yes, I'm using the same postgresql.conf as I was using when running
9.2.4.  Its definitely not running with default install values.

> c) What does logs say?

The postgres server logs look perfectly normal, minus a non-trivial
slower run time for most queries.  There's nothing unusual in any of
the OS level logs (/var/log/messages, etc) or dmesg.



Do you have any other suggestions?


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Jeff Janes
Дата:
On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman <netllama@gmail.com> wrote:

> c) What does logs say?

The postgres server logs look perfectly normal, minus a non-trivial
slower run time for most queries.  There's nothing unusual in any of
the OS level logs (/var/log/messages, etc) or dmesg.

Are you generally CPU limited or IO limited?

Grab one of those slower queries and do an "explain (analyze, buffers)" of it.  Preferably with track_io_timing on.

Do you still have test/dev/QA/etc copy running on 9.2 for comparison?

Cheers,

Jeff

Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Lonni J Friedman
Дата:
Thanks for your reply.  Comments/answers inline below....

On Tue, Sep 17, 2013 at 11:28 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Sep 17, 2013 at 11:22 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>>
>> > c) What does logs say?
>>
>> The postgres server logs look perfectly normal, minus a non-trivial
>> slower run time for most queries.  There's nothing unusual in any of
>> the OS level logs (/var/log/messages, etc) or dmesg.
>
>
> Are you generally CPU limited or IO limited?

top shows over 90% of the load is in sys space.  vmstat output seems
to suggest that its CPU bound (or bouncing back & forth):
########
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  17308 852016 141104 127074192    0    0    10    18    0    0
6  4 90  0  0
 0  0  17308 872316 141104 127074200    0    0     0   988  940  564
1  0 99  0  0
 0  0  17308 884288 141104 127074208    0    0     0  1921 1202 2132
1  0 99  0  0
 0  0  17308 898728 141104 127074208    0    0     0     0 1064  577
1  0 99  0  0
 2  0  17308 914920 141104 127074224    0    0     0    44  820  427
1  0 99  0  0
 0  0  17308 926524 141104 127074272    0    0     0    48 1173  585
1  0 99  0  0
108  1  17308 753648 141104 127074224    0    0     0   236 9825 3901
12  5 83  0  0
50  0  17308 723156 141104 127074400    0    0     0   144 43481 9105
20 79  1  0  0
45  0  17308 722860 141104 127074416    0    0     0     8 32969 1998
1 97  2  0  0
47  0  17308 738996 141104 127074416    0    0     0     0 34099 1739
1 99  0  0  0
101  0  17308 770220 141104 127074480    0    0     0    32 38550 5998
 7 93  0  0  0
101  0  17308 775732 141104 127074512    0    0     0   156 33889 5809
 4 96  0  0  0
99  0  17308 791232 141104 127074544    0    0     0     0 32385 4981
0 100  0  0  0
96  0  17308 803156 141104 127074544    0    0     0    24 32413 4824
0 100  0  0  0
87  0  17308 811624 141104 127074544    0    0     0     0 32438 4470
0 100  0  0  0
83  0  17308 815500 141104 127074544    0    0     0     0 32489 4159
0 100  0  0  0
80  0  17308 826572 141104 127074552    0    0     0    33 32582 3948
0 100  0  0  0
73  0  17308 853264 141108 127074552    0    0     0    52 32833 3840
0 100  0  0  0
73  0  17308 882240 141108 127074560    0    0     0     4 32820 3594
0 100  0  0  0
72  0  17308 892256 141108 127074560    0    0     0     0 32368 3516
0 100  0  0  0
#######

iostat consistently shows %util under 1.00 which also suggests that
disk IO is not the bottleneck:
#########
iostat -dx /dev/sdb 5
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7)     09/17/2013     _x86_64_
   (32 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.02     0.21    6.91   31.33   651.60  1121.85
46.38     0.09    2.25   0.08   0.31

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    8.00     0.00    93.00
11.62     0.00    0.28   0.20   0.16

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00   11.00     0.00   125.40
11.40     0.00    0.16   0.16   0.18

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00  105.00     0.00  3380.40
32.19     0.29    2.76   0.03   0.34

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00   14.80     0.00  2430.60
164.23     0.00    0.12   0.09   0.14

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     1.20    0.00   41.60     0.00  1819.40
43.74     0.02    0.45   0.05   0.20

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    2.80     0.00    32.00
11.43     0.00    0.00   0.00   0.00
#########

mpstat also shows a virtually 0 iowait, with a ton of sys (CPU) time:
#########
mpstat 2 10
Linux 2.6.32-358.6.2.el6.x86_64 (cuda-db7)     09/17/2013     _x86_64_
   (32 CPU)

12:53:19 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
12:53:21 PM  all    7.36    0.00   92.58    0.00    0.00    0.03
0.00    0.00    0.03
12:53:23 PM  all    6.35    0.00   90.43    0.00    0.00    0.03
0.00    0.00    3.19
12:53:25 PM  all    3.13    0.00   68.20    0.00    0.00    0.02
0.00    0.00   28.66
12:53:27 PM  all    6.07    0.00   68.46    0.00    0.00    0.03
0.00    0.00   25.44
12:53:29 PM  all    5.83    0.00   94.14    0.00    0.00    0.03
0.00    0.00    0.00
12:53:31 PM  all    5.75    0.00   94.14    0.00    0.00    0.11
0.00    0.00    0.00
12:53:33 PM  all    7.65    0.00   40.32    0.00    0.00    0.03
0.00    0.00   52.00
12:53:35 PM  all    3.78    0.00    0.94    0.00    0.00    0.03
0.00    0.00   95.25
12:53:37 PM  all   16.09    0.00   14.33    0.02    0.00    0.05
0.00    0.00   69.51
12:53:39 PM  all    5.60    0.00   10.38    0.00    0.00    0.05
0.00    0.00   83.97
Average:     all    6.76    0.00   57.41    0.00    0.00    0.04
0.00    0.00   35.79
#########

So unless I've misunderstood the output above, all of this suggests
that the CPU is the bottleneck.  If you'd like some other data let me
know and I'll be happy to capture it.


> Grab one of those slower queries and do an "explain (analyze, buffers)" of
> it.  Preferably with track_io_timing on.

Just to be clear, just about every query is slow, not just certain
queries.  The entire server is slow, and the only thing that has
changed is the version of postgres (granted, not a trivial change).
Anyway, I picked a random query, connected with psql and ran the
following:
########
nightly=# SET track_io_timing=on ;
SET

nightly=# EXPLAIN (analyze,buffers) SELECT count(*) FROM
pg_catalog.pg_class AS c WHERE c.oid =
pgpool_regclass('dblcufft_testlist') AND c.relpersistence = 'u';
                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------
-------
 Aggregate  (cost=2779.71..2779.72 rows=1 width=0) (actual
time=83.136..83.136 rows=1 loops=1)
   Buffers: shared hit=1633
   ->  Seq Scan on pg_class c  (cost=0.00..2779.70 rows=1 width=0)
(actual time=83.131..83.131 rows=0 loops=1)
         Filter: ((relpersistence = 'u'::"char") AND (oid =
pgpool_regclass('dblcufft_testlist'::cstring)))
         Rows Removed by Filter: 70393
         Buffers: shared hit=1633
 Total runtime: 83.181 ms
(7 rows)
########

I'm not quite sure that I understand how to interpret this output.  I
didn't have track_io_timings enabled by default.  Does it require a
restart to enable?    The official dox seem to suggest that there
should be a number of additional fields, which I'm not seeing.

>
> Do you still have test/dev/QA/etc copy running on 9.2 for comparison?

Not live.  I'd have to restore from a backup, and my test environment
is not the same class of hardware as production.  I could get that
process underway if you wanted.  What did you have in mind?


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Andres Freund
Дата:
Hi,

On 2013-09-17 09:19:29 -0700, Lonni J Friedman wrote:
> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
> replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
> performance degradation.  PostgreSQL simply feels slower.  Nothing
> other than the version of PostgreSQL changed yesterday.  I used
> pg_upgrade to perform the upgrade, and ran the generated
> analyze_new_cluster.sh immediately afterwards, which completed
> successfully.

Where did you get 9.3.0 from? Compiled it yourself? Any chance you
compile with --enable-cassert or somesuch?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Lonni J Friedman
Дата:
On Tue, Sep 17, 2013 at 3:47 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> On 2013-09-17 09:19:29 -0700, Lonni J Friedman wrote:
>> I'm running a PostgreSQL 9.3.0 cluster (1 master with two streaming
>> replication hot standby slaves) on RHEL6-x86_64.  Yesterday I upgraded
>> from 9.2.4 to 9.3.0, and since the upgrade I'm seeing a significant
>> performance degradation.  PostgreSQL simply feels slower.  Nothing
>> other than the version of PostgreSQL changed yesterday.  I used
>> pg_upgrade to perform the upgrade, and ran the generated
>> analyze_new_cluster.sh immediately afterwards, which completed
>> successfully.
>
> Where did you get 9.3.0 from? Compiled it yourself? Any chance you
> compile with --enable-cassert or somesuch?

Directly from http://yum.postgresql.org.  So unless the RPMs on there
are built weird/wrong, I don't think that's the problem.


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Kevin Grittner
Дата:
Lonni J Friedman <netllama@gmail.com> wrote:

> top shows over 90% of the load is in sys space.  vmstat output
> seems to suggest that its CPU bound (or bouncing back & forth):

Can you run `perf top` during an episode and see what kernel
functions are using all that CPU?

This looks similar to cases I've seen of THP defrag going wild.
Did the OS version or configuration change?  Did the PostgreSQL
memory settings (like shared_buffers) change?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Lonni J Friedman
Дата:
On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Lonni J Friedman <netllama@gmail.com> wrote:
>
>> top shows over 90% of the load is in sys space.  vmstat output
>> seems to suggest that its CPU bound (or bouncing back & forth):
>
> Can you run `perf top` during an episode and see what kernel
> functions are using all that CPU?

Oddly, the problem went away on its own yesterday just after 4PM, and
performance has remained 'normal' since that time.  I changed
absolutely nothing.  If/when it returns, I'll certainly capture that
output.

>
> This looks similar to cases I've seen of THP defrag going wild.
> Did the OS version or configuration change?  Did the PostgreSQL
> memory settings (like shared_buffers) change?

Nothing changed other than the version of postgres.  I re-used the
same postgresql.conf that was in place when running 9.2.x.

Anyway, here are the current THP related settings on the server:
[root@cuda-db7 ~]# grep AnonHugePages /proc/meminfo
AnonHugePages:    548864 kB
[root@cuda-db7 ~]# egrep 'trans|thp' /proc/vmstat
nr_anon_transparent_hugepages 272
thp_fault_alloc 129173889
thp_fault_fallback 17462551
thp_collapse_alloc 148437
thp_collapse_alloc_failed 15143
thp_split 242


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Lonni J Friedman
Дата:
On Wed, Sep 18, 2013 at 2:02 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Lonni J Friedman <netllama@gmail.com> wrote:
>
>> top shows over 90% of the load is in sys space.  vmstat output
>> seems to suggest that its CPU bound (or bouncing back & forth):
>
> Can you run `perf top` during an episode and see what kernel
> functions are using all that CPU?

I take back what I said earlier.  While the master is currently back
to normal performance, the two hot standby slaves are still churning
something awful.

If I run 'perf top' on either slave, after a few seconds, these are
consistently the top three in the list:
 84.57%  [kernel]               [k] _spin_lock_irqsave
  6.21%  [unknown]              [.] 0x0000000000659f60
  4.69%  [kernel]               [k] compaction_alloc

>
> This looks similar to cases I've seen of THP defrag going wild.
> Did the OS version or configuration change?  Did the PostgreSQL
> memory settings (like shared_buffers) change?

I think you're onto something here with respect to THP defrag going
wild.  I set /sys/kernel/mm/transparent_hugepage/defrag to 'never' and
immediately the load dropped on both slaves from over 5.00 to under
1.00.

So this raises the question, is this a kernel bug, or is there some
other solution to the problem?
Also, seems weird that the problem didn't happen until I switched from
9.2 to 9.3.  Is it possible this is somehow related to the change from
using SysV shared memory to using Posix shared memory and mmap for
memory management?


Re: upgrade from 9.2.x to 9.3 causes significant performance degradation

От
Jeff Janes
Дата:
On Wed, Sep 18, 2013 at 10:30 AM, Lonni J Friedman <netllama@gmail.com> wrote:

>
> This looks similar to cases I've seen of THP defrag going wild.
> Did the OS version or configuration change?  Did the PostgreSQL
> memory settings (like shared_buffers) change?

I think you're onto something here with respect to THP defrag going
wild.  I set /sys/kernel/mm/transparent_hugepage/defrag to 'never' and
immediately the load dropped on both slaves from over 5.00 to under
1.00.

So this raises the question, is this a kernel bug, or is there some
other solution to the problem?
Also, seems weird that the problem didn't happen until I switched from
9.2 to 9.3.  Is it possible this is somehow related to the change from
using SysV shared memory to using Posix shared memory and mmap for
memory management?

I would guess that it is probably a kernel bug which gets efficiently exercised by the SysV to Posix change.

Cheers,

Jeff