Re: PG performance issues related to storage I/O waits

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: PG performance issues related to storage I/O waits
Дата
Msg-id 51FD6A41.8020607@fuzzy.cz
обсуждение исходный текст
Ответ на PG performance issues related to storage I/O waits  (Tasos Petalas <tasos.petalas@upstreamsystems.com>)
Список pgsql-performance
Hi,

On 1.8.2013 21:47, Tasos Petalas wrote:> I attach postgresql.conf, sar
output and server/os/disk info
>
> Will update you with iotop outputs shortly.

So, have you found anything interesting using iotop?

>> On Wed, Jul 31, 2013 at 3:40 PM, Tomas Vondra <tv@fuzzy.cz
>> <mailto:tv@fuzzy.cz>> wrote:
>>
>> On 31 Červenec 2013, 13:58, Tasos Petalas wrote:
>>> Hello team,
>>>
>>> We have serious performance issues with a production EDB 9.2AS
>>> installation
>>
>> Are you sure you've sent this to the right list? Because this is a
>> community mailing list for PostgreSQL, not for EDB. If you have a
>> support contract with EDB it's probably better to ask them directly
>> (e.g. they might give you advices about some custom features not
>> available in vanilla PostgreSQL).
>
> We do have a support contract with did not provide any valuable
> feedback so far. As a person coming from the community world I
> believe I can get answers here.

You can certainly get help here, although probably only for plain
PostgreSQL, not for the features available only in EDB.

>>> The issue is mostly related to storage I/O bottlenecks during
>>> peak hours and we are looking for tunables on any level that
>>> could reduce the I/O spikes on SAN and improve overall DB
>>> performance.
>>
>> So is that a dedicated DWH machine, and PostgreSQL is responsible
>> for most of the I/O load? Which processes are doing that? Backends
>> handling queries or some background processes (say, checkpoints)?
>> Is that random or sequential I/O, reads or writes, ...?
>
> The system is a heavy OLTP system. Bottlenecks are mostly related to
> I/O writes (many small chunks). Machine is only hosting postgres.

That however still doesn't say which processes are responsible for that.
Is that background writer, backends running queries or what? The iotop
should give you answer to this (or at least a hint).

> How much I/O are we talking about? Could it be that the SAN is
> overloaded by someone else (in case it's not dedicated to the
> database)?
>
> Check SAR results (pg_data resides on dev 253-2 (RAID-10),
> pg_archives on 253-3 (RAID-5)

What is pg_archives? Also RAID-5 is generally poor choice for write
intensive workloads.

Also, how are these volumes defined? Do they use distinct sets of disks?
How many disks are used for each volume?

> It might turn out that the most effective solution is tuning the
> queries that are responsible for the I/O activity.
>
> SAN is almost dedicated to the host running postgres.

And? How does that contradict my suggestion to tune the queries? For
each piece of hardware there are bottlenecks and a query that can hit
them. In those cases the best solution often is tuning the query.

>>> Our storage array consists of 16 disks in RAID-10 topology
>>> (device 253-2 on our OS configuration). We are also using RAID-5
>>>  for archive_log storage (also presented by SAN to the same
>>> machine - device 253-3)
>>
>> I have no clue what device 253-3 is, but I assume you're using SAS
>>  disks.
>
> Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers to sar
> output for disks)

OK, so the pg_archives is probably for xlog archive, right?

>>> We have set synchronous_commit to off but since almost all of
>>> application queries are using prepared statements we don't get
>>> any real benefit.
>>
>> Ummmm, how is this related? AFAIK those are rather orthogonal
>> features, i.e. prepared statements should benefit from
>> synchronous_commit=off just like any other queries.
>
> it is not prepared statements. It is distributed transactions
> (queries inside a PREPARE TRANSACTION block). / / /Certain utility
> commands, for instance DROP TABLE, are forced to commit
> synchronously regardless of the setting of synchronous_commit. This
> is to ensure consistency between the server's file system and the
> logical state of the database. The commands supporting two-phase
> commit, such as PREPARE TRANSACTION, are also always synchronous./
>
> _/Taken form
> http://www.postgresql.org/docs/9.2/static/wal-async-commit.html/_

Well, then please use the proper term - prepared transactions and
prepared statements are two very different things.

But yes, if you're using prepared transactions (for 2PC), then yes,
synchronous_commit=off is not going to improve anything as it simply has
to be synchronous.

>>> We are using VMware , VMFS and LVM so we need your feedback on
>>> any kind of tunable that could remove load from storage during
>>> peak hours (FYI application peak hours are 13:00-23:00 UTC,
>>> during night (04:00-06:00 UTC) there are some heavy reporting
>>> activity + backups) Archive logs are rsync-ed to a remote backup
>>> server every 20 minutes.
>>>
>>> Also please advise on any postgres.conf modification that could
>>> significantly affect storage load (WAL-checkpoint configuration
>>> etc.) (we have not tried to move pg_xlog to a separate LUN since
>>> this is not an option - any other LUN would be using the same
>>> storage pool as the rest of the /pgdata files) We had some
>>> issues in the past with autovaccum deamon failing to work
>>> efficiently under high load so we have already applied your
>>> instructions for a more aggressive auto-vacumm policy (changes
>>> already applied on postgresql.conf)
>>>
>>> Let me know if you want me to attach all the usual info for
>>> tickets regarding (OS, disks, PG conf, etc) plus the sar output
>>> and server logs from the last 3 days (24,25,26 June).
>>
>> Well, we can't really help you unless you give us this, so yes -
>> attach this info. And please try to identify what is actually
>> causing most I/O activity (e.g. using "iotop").
>
> SAR outputs, postgresql.conf, other os/system h/w info attached.

I've checked the conf, and I think you should really consider increasing
checkpoint_segments - it's set to 3 (= 64MB) but I think something like
32 (=512MB) or even more would be more appropriate.

I see you've enabled log_checkpoints - can you check your logs how often
the checkpoints happen?

Also, can you check pg_stat_bgwriter view? I'd bet the value in
checkpoints_timed is very low, compared to checkpoints_req. Or even
better, get the values from this view before / after running the batch jobs.

Anyway, I don't think writes are the main problem here - see this:

                  DEV       tps  rd_sec/s  wr_sec/s   await   %util
04:00:01     dev253-2    176.56    134.07   1378.08    3.70    3.53
04:10:01     dev253-2    895.22  11503.99   6735.08   16.63    8.24
04:20:01     dev253-2    455.35  25523.80   1362.37    2.38   16.81
04:30:01     dev253-2    967.29  95471.88   4193.50    6.70   37.44
04:40:01     dev253-2    643.31  80754.86   2456.40    3.35   29.70
04:50:01     dev253-2    526.35  84990.05   1323.28    2.07   29.41
05:00:01     dev253-2    652.68  73192.18   1297.20    1.89   28.51
05:10:01     dev253-2   1256.31  34786.32   5840.08    9.25   53.08
05:20:01     dev253-2    549.84  14530.45   3522.85    8.12    9.89
05:30:01     dev253-2   1363.27 170743.78   5490.38    7.53   59.75
05:40:01     dev253-2    978.88 180199.97   1796.90    2.54   74.08
05:50:01     dev253-2   1690.10 166467.91   8013.10   35.45   66.32
06:00:01     dev253-2   2441.94 111316.65  15245.05   34.90   41.78

it's a slightly modified sar output for the main data directory (on
253-2). It clearly shows you're doing ~50MB/s of (random) reads compared
to less than 5MB/s of writes (assuming a sector is 512B).

There's almost no activity on the pg_archives (253-3) device:

00:00:01          DEV       tps  rd_sec/s  wr_sec/s   await   %util
04:00:01     dev253-3     20.88      0.01    167.00   14.10    0.02
04:10:01     dev253-3    211.06      0.05   1688.43   43.61    0.12
04:20:01     dev253-3     14.30      0.00    114.40    9.95    0.01
04:30:01     dev253-3    112.78      0.45    901.75   17.81    0.06
04:40:01     dev253-3     14.11      0.00    112.92   10.66    0.01
04:50:01     dev253-3      7.39     56.94     56.85   10.91    0.04
05:00:01     dev253-3     14.21      0.00    113.67   10.92    0.01
05:10:01     dev253-3      7.05      0.26     56.15   17.03    0.02
05:20:01     dev253-3     28.38     18.20    208.87    8.68    0.29
05:30:01     dev253-3     41.71      0.03    333.63   14.70    0.03
05:40:01     dev253-3      6.95      0.00     55.62   10.39    0.00
05:50:01     dev253-3    105.36    386.44    830.83    9.62    0.19
06:00:01     dev253-3     13.92      0.01    111.34   10.41    0.01

In the afternoon it's a different story - for 253-2 it looks like this:

                  DEV       tps  rd_sec/s  wr_sec/s   await    %util
15:50:01     dev253-2   4742.91  33828.98  29156.17   84.84   105.14
16:00:01     dev253-2   2781.05  12737.41  18878.52   19.24    80.53
16:10:01     dev253-2   3661.51  20950.64  23758.96   36.86    99.03
16:20:01     dev253-2   5011.45  32454.33  31895.05   72.75   102.38
16:30:01     dev253-2   2638.08  14661.23  17853.16   25.24    75.64
16:40:01     dev253-2   1988.95   5764.73  14190.12   45.05    58.80
16:50:01     dev253-2   2185.15  88296.73  11806.38    7.46    84.37
17:00:01     dev253-2   2031.19  12835.56  12997.34    8.90    82.62
17:10:01     dev253-2   4009.24  34288.71  23974.92   38.07   103.01
17:20:01     dev253-2   3605.86  26107.83  22457.41   45.76    90.90
17:30:01     dev253-2   2550.47   7496.85  18267.07   19.10    65.87

So this is about 50:50 reads and writes, and this is also the time when
with some measurable activity on the 253-3 device:

00:00:01          DEV       tps  rd_sec/s  wr_sec/s   await    %util
15:50:01     dev253-3   1700.97   9739.48  13249.75   22.63     8.53
16:00:01     dev253-3    807.44    512.95   6439.17   15.21     0.82
16:10:01     dev253-3   1236.72     22.92   9892.26   28.74     0.95
16:20:01     dev253-3   1709.15      0.52  13672.70   40.89     1.69
16:30:01     dev253-3    919.26   8217.60   7051.60   20.40    11.74
16:40:01     dev253-3    601.66      0.37   4812.94   18.99     0.39
16:50:01     dev253-3    476.40      0.42   3810.95   10.02     0.28
17:00:01     dev253-3    636.03      0.15   5088.08   11.01     0.35
17:10:01     dev253-3   1259.55    165.64  10069.65   15.18     1.01
17:20:01     dev253-3   1194.10      0.29   9552.49   26.11     0.94
17:30:01     dev253-3    785.40   2000.52   6201.21   33.12     3.06

Still, this is pretty low write activity, and it's sequential.

What I think you could/should do:

* increase checkpoint_segments to 64 (or something)

* move pg_xlog to a separate device (not simply a volume on the SAN,
  sharing disks with the other volumes - that won't give you anything)

I'd expect these changes to improve the afternoon peak, as it's doing
about 50% writes. However I would not expect this to improve the morning
peak, because that's doing a lot of reads (not writes).

The trick here is that your database is ~300GB - you're doing a lot of
seeks across the whole database, and it's way more than the RAM. So it
has to actually access the disks. 15k disks have ~200 IOPS each, adding
them to RAID-10 generally gives you an array with

    (n/2) * (IOPS for a single disk)

I don't know how exactly are your volumes defined on your SAN, but
assuming you have RAID-10 on 12 drives gives you ~1200 IOPS.

This is the bottleneck you're hitting, and there's not much you can do
about it:

  * getting better storage (giving you more seeks) - say SSD-like disks
  * improving the application/queries/schema to minimize the seeks

And so on.

Tomas


В списке pgsql-performance по дате отправления:

Предыдущее
От: Scott Marlowe
Дата:
Сообщение: Re: subselect requires offset 0 for good performance.
Следующее
От:
Дата:
Сообщение: Re: Sub-optimal plan for a paginated query on a view with another view inside of it.