Обсуждение: Re: Perplexing, regular decline in performance

Поиск
Список
Период
Сортировка

Re: Perplexing, regular decline in performance

От
Benjamin Scherrey
Дата:
Have you done a VACUUM ANALYZE FULL on your database? This needs to be done periodically to inform the server of the statistics of how the data and relations are distributed across the database. Without this bad assumptions by the planner can cause degradation of performance. Also, if you are using the default settings in postgres.conf then understand those are established to use the absolute minimum amount of resources possible which means not taking advantage of available memory or CPUs that may be available in your environment that would make the database server more performant.

Please investigate these and then report back any details of what you've done to try to improve performance.

  best regards,


On Tue, Jun 25, 2019 at 10:49 PM Hugh Ranalli <hugh@whtc.ca> wrote:
I'm hoping people can help me figure out where to look to solve an odd PostgreSQL performance problem. 

A bit of background: We have a client with a database of approximately 450 GB, that has a couple of tables storing large amounts of text, including full HTML pages from the Internet. Last fall, they began experiencing dramatic and exponentially decreasing performance. We track certain query times, so we know how much time is being spent in calls to the database for these functions. When this began, the times went from about an average of approximate 200 ms to 400 ms, rapidly climbing each day before reaching 900 ms, figures we had never seen before, within 4 days, with no appreciable change in usage. It was at this point that we restarted the database server and times returned to the 400 ms range, but never back to their long-running original levels. From this point onward, we had to restart the database (originally the server, but eventually just the database process) every 3-4 days, otherwise the application became unusable.

As they were still on PostgreSQL 8.2, we persuaded them to finally undertake our long-standing recommendation to upgrade, as there was no possibility of support on that platform. That upgrade to 11.2 was completed successfully in mid-May, and although times have not returned to their original levels (they now average approximately 250 ms), the application overall seems much more responsive and faster (application servers were not changed, other than minor changes --full text search, explicit casts, etc.-- to conform to PostgreSQL 11's requirements).

What we continued to notice was a milder but still definite trend of increased query times, during the course of each week, from the mid to high 200 ms, to the high 300 ms to low 400 ms. Some years ago, someone had noticed that as the number of "raw_page" columns in a particular table grew, performance would decline. They wrote a script that once a week locks the table, deletes the processed large columns (they are not needed after processing), copies the remaining data to a backup table, truncates the original table, then copies it back. When this script runs we see an immediate change in performance, from 380 ms in the hour before the drop, to 250 ms in the hour of the drop. As rows with these populated columns are added during the course of a week, the performance drops, steadily, until the next week's cleaning operation. Each week the performance increase is clear and significant.

What is perplexing is (and I have triple checked), that this table is *not* referenced in any way in the queries that we time (it is referenced by ongoing administrative and processing queries). The operation that cleans it frees up approximately 15-20 GB of space each week. Our system monitoring shows this change in free disk space, but this is 20 GB out of approximately 300 GB of free space (free space is just under 40% of volume size), so disk space does not seem to be an issue. The table in question is about 21 GB in size, with about 20 GB in toast data, at its largest.

Even odder, the queries we time *do* reference a much larger table, which contains very similar data, and multiple columns of it. It is 355 GB in size, with 318 GB in toast data. It grows continually, with no cleaning.

If anyone has any suggestions as to what sort of statistics to look at, or why this would be happening, they would be greatly appreciated.

Thanks in advance,
Hugh

--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh@whtc.ca
c: +01-416-994-7957

Re: Perplexing, regular decline in performance

От
Justin Pryzby
Дата:
On Tue, Jun 25, 2019 at 10:55:22PM +0700, Benjamin Scherrey wrote:
> Have you done a VACUUM ANALYZE FULL on your database? This needs to be done
> periodically to inform the server of the statistics of how the data and
> relations are distributed across the database.

I think this is wrong.

VACUUM and ANALYZE are good, but normally happen automatically by autovacuum.

VACUUM FULL takes an exclusive lock on the table, and rewrites its data and
indices from scratch.  It's not normally necessary at all.  It's probably most
useful to recover from badly-bloated table if autovacuum didn't run often
enough, in which case the permanent solution is to change autovacuum settings
to be more aggressive.

Don't confuse with VACUUM FREEZE, which doesn't require exclusive lock, and
normally not necessary if autovacuum is working properly.

Justin



Re: Perplexing, regular decline in performance

От
Hugh Ranalli
Дата:
On Tue, 25 Jun 2019 at 11:55, Benjamin Scherrey <scherrey@proteus-tech.com> wrote:
Have you done a VACUUM ANALYZE FULL on your database? This needs to be done periodically to inform the server of the statistics of how the data and relations are distributed across the database. Without this bad assumptions by the planner can cause degradation of performance.
 
Autovacuum is enabled. As well, we had problems with autovacum running reliably in 8.2, so we are still running a nightly script that runs VACUUM ANALYZE on the complete database. As for VACUUM ANALYZE FULL, the database underwent a full dump and reload, which, as I understand it, would have rebuilt the indexes, followed by an ANALYZE to update the planner. So I'm not sure a VACUUM ANALYZE FULL would have much effect. I'm also not sure how it bears on the problem stated here, where the planner shouldn't even be looking at this table in the queries we are timing.

Also, if you are using the default settings in postgres.conf then understand those are established to use the absolute minimum amount of resources possible which means not taking advantage of available memory or CPUs that may be available in your environment that would make the database server more performant.
 
No, we attempted to tune these, using  https://pgtune.leopard.in.ua. The following values are from our install script (hence why they don't look exactly like their .conf versions). And, as someone else asked, transparent huge pages are enabled:
# DB Version: 11
# OS Type: linux
# DB Type: web
# Total Memory (RAM): 128 GB
# CPUs = threads per core * cores per socket * sockets
# CPUs num: 256
# Connections num: 250
# Data Storage: ssd

# Set via sysctl
# 64 GB in 4096 byte pages on our 128GB production system
shmall = 15777216
# 48 GB on our 128GB production system
shmmax = 51,539,607,552

# Huge Pages
# Set via sysctl
huge-pages-alloc = 0

shared-buffers = 32GB
work-mem = 1024kB
maintenance-work-mem = 2GB
max-stack-depth = 4MB
effective-io-concurrency = 200
max-parallel-workers-per-gather = 128
max-parallel-workers = 256

#
# postgresql-conf-archive
#
wal-buffers = 16MB
min-wal-size = 1GB
max-wal-size = 2GB
checkpoint-completion-target = 0.7
archive-mode = on
archive-timeout = 900      

#
# postgresql-conf-query
#
# 75% of production memory
effective-cache-size = 96GB
# SSD drives
random-page-cost = 1.1
default-statistics-target = 100

I'll be providing further details in reply to another message in the thread.

Thanks!