Обсуждение: [HACKERS] autovacuum can't keep up, bloat just continues to rise

Поиск
Список
Период
Сортировка

[HACKERS] autovacuum can't keep up, bloat just continues to rise

От
"Joshua D. Drake"
Дата:
Hello,

At PGConf US Philly last week I was talking with Jim and Jan about 
performance. One of the items that came up is that PostgreSQL can't run 
full throttle for long periods of time. The long and short is that no 
matter what, autovacuum can't keep up. This is what I have done:

Machine:

16vCPU
59G Memory
10G SSD (/)
500G SSD /srv/main/9.6 (PGDATA) : 240MB Sustained with 15k IOPS    * Yes, we really got 240MB sustained performance

I used benchmarksql which is a tpc-c benchmark similar to pgbench but 
supposedly more thorough.

https://sourceforge.net/projects/benchmarksql/

PostgreSQL 9.6 on Ubuntu 16.04 x64.

postgresql.conf:

max_connections: 1000 (just to keep it out of the way)
shared_buffers: 32G (Awesome work Haas)
work_mem: 32M
maintenance_work_mem: 2G
effective_io_concurrency: 1
* Before anybody suggests increasing this, on GCE over a dozen tests, 
anything but disabling this appears to be a performance hit of ~ 10% (I 
can reproduce those tests if you like on another thread).

synchronous_commit: off
checkpoint_timeout: 60min
max_wal_size: 5G
random_page_cost: 1
effective_cache_size: 32GB*this probably should be more like 50 but still
autovacuum_max_workers: 12* One for each table + a couple for system tables
autovacuum_vacuum_scale_factor: 0.1
autovacuum_cost_delay: 5ms

Here are the benchmarksql settings for all 4 runs:

17:07:54,268 [main] INFO   jTPCC : Term-00, warehouses=500
17:07:54,269 [main] INFO   jTPCC : Term-00, terminals=128
17:07:54,272 [main] INFO   jTPCC : Term-00, runTxnsPerTerminal=100000
17:07:54,273 [main] INFO   jTPCC : Term-00, limitTxnsPerMin=300000
17:07:54,273 [main] INFO   jTPCC : Term-00, terminalWarehouseFixed=false
17:07:54,274 [main] INFO   jTPCC : Term-00,
17:07:54,274 [main] INFO   jTPCC : Term-00, newOrderWeight=45
17:07:54,274 [main] INFO   jTPCC : Term-00, paymentWeight=43
17:07:54,274 [main] INFO   jTPCC : Term-00, orderStatusWeight=4
17:07:54,275 [main] INFO   jTPCC : Term-00, deliveryWeight=4
17:07:54,275 [main] INFO   jTPCC : Term-00, stockLevelWeight=4

For run 0, I started with:

vacuumdb -U postgres -fz;./runBenchmark.sh my_postgres.properties

And then for each subsequent run, I just ran the benchmark without the 
vacuum full so that PostgreSQL could prove us wrong. It didn't. Here is 
the break down of the results:

RUN    START DISK SIZE    END DISK SIZE    TPS/Terminal
0    54        78        868.6796875
1    78        91        852.4765625
2    91        103        741.4609375
3    103        116        686.125

The good news is, PostgreSQL is not doing half bad against 128 
connections with only 16vCPU. The bad news is we more than doubled our 
disk size without getting reuse or bloat under control. The concern here 
is that under heavy write loads that are persistent, we will eventually 
bloat out and have to vacuum full, no matter what. I know that Jan has 
done some testing and the best he could get is something like 8 days 
before PostgreSQL became unusable (but don't quote me on that).

I am open to comments, suggestions, running multiple tests with 
different parameters or just leaving this in the archive for people to 
reference.

Thanks in advance,

JD


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Peter Geoghegan
Дата:
On Wed, Jul 19, 2017 at 3:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> The good news is, PostgreSQL is not doing half bad against 128 connections
> with only 16vCPU. The bad news is we more than doubled our disk size without
> getting reuse or bloat under control. The concern here is that under heavy
> write loads that are persistent, we will eventually bloat out and have to
> vacuum full, no matter what. I know that Jan has done some testing and the
> best he could get is something like 8 days before PostgreSQL became unusable
> (but don't quote me on that).
>
> I am open to comments, suggestions, running multiple tests with different
> parameters or just leaving this in the archive for people to reference.

Did you see my blogpost on planet PostgreSQL from last night?

https://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html

Perhaps you could use my query to instrument an interesting index, to
see what that turns up. I would really like to get a better sense of
how often and to what extent index bloat is a problem that VACUUM is
just inherently incapable of keeping under control over time. The
timeline for performance to degrade with real production workloads is
very interesting to me. It's really hard to simulate certain types of
problems that you will see frequently in production.

Index bloat is a general problem that B-Trees have in all other major
systems, but I think that PostgreSQL has a tendency to allow indexes
to become progressively more bloated over time, in a way that it often
can never recover from [1]. This may be a particular problem with
unique indexes, where many physical duplicates accumulate in pages.
Duplicates that are theoretically reclaimable, but due to how the
keyspace is split up, will never actually be reclaimed [2].

[1] postgr.es/m/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1NGQ@mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Key_normalization#VACUUM_and_nbtree_page_deletion
-- 
Peter Geoghegan



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Tom Lane
Дата:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> At PGConf US Philly last week I was talking with Jim and Jan about 
> performance. One of the items that came up is that PostgreSQL can't run 
> full throttle for long periods of time. The long and short is that no 
> matter what, autovacuum can't keep up. This is what I have done:

Try reducing autovacuum_vacuum_cost_delay more, and/or increasing
autovacuum_vacuum_cost_limit.
        regards, tom lane



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Alvaro Herrera
Дата:
Peter Geoghegan wrote:

> Index bloat is a general problem that B-Trees have in all other major
> systems, but I think that PostgreSQL has a tendency to allow indexes
> to become progressively more bloated over time, in a way that it often
> can never recover from [1].

Interesting assertion.  Many years ago I set to implement btree page
merging[1] from ideas in a 1996 paper[2], though that work never saw the
light of day.  Maybe it can be valuable now.

Another thing to consider is indirect indexes, wherein you get less heap
bloat because more updates can become HOT.

If we also allow heap to be pruned from line pointers by walking indexes
to remove specific pointers, instead of requiring indexes to be scanned
whole for the removal, as proposed by Andres, we could get further
along.

[1] https://www.postgresql.org/message-id/20020912235429.4714071a.alvherre@atentus.com
[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.9961

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Joe Conway
Дата:
On 07/19/2017 03:29 PM, Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> At PGConf US Philly last week I was talking with Jim and Jan about
>> performance. One of the items that came up is that PostgreSQL can't run
>> full throttle for long periods of time. The long and short is that no
>> matter what, autovacuum can't keep up. This is what I have done:
>
> Try reducing autovacuum_vacuum_cost_delay more, and/or increasing
> autovacuum_vacuum_cost_limit.

I would try autovacuum_vacuum_cost_delay = 0
and for any tables > 1 million rows: autovacuum_vacuum_scale_factor: 0 autovacuum_vacuum_threshold: 100000 (perhaps
evensmaller) 

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> > At PGConf US Philly last week I was talking with Jim and Jan about
> > performance. One of the items that came up is that PostgreSQL can't run
> > full throttle for long periods of time. The long and short is that no
> > matter what, autovacuum can't keep up. This is what I have done:
>
> Try reducing autovacuum_vacuum_cost_delay more, and/or increasing
> autovacuum_vacuum_cost_limit.

Or get rid of the cost delay entirely and let autovacuum actually go as
fast as it can when it's run.  The assertion that it can't keep up is
still plausible, but configuring autovacuum to sleep regularly and then
complaining that it's not able to keep up doesn't make sense.

Reducing the nap time might also be helpful if autovacuum is going as
fast as it can and it's able to clear a table in less than a minute.

There have been discussions on this list about parallel vacuum of a
particular table as well; to address this issue I'd encourage reviewing
those discussions and looking at writing a patch to implement that
feature as that would address the case where the table is large enough
that autovacuum simply can't get through all of it before the other
backends have used all space available and then substantially increased
the size of the relation (leading to vacuum on the table running for
longer).

Thanks!

Stephen

Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Peter Geoghegan
Дата:
On Wed, Jul 19, 2017 at 3:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> Index bloat is a general problem that B-Trees have in all other major
>> systems, but I think that PostgreSQL has a tendency to allow indexes
>> to become progressively more bloated over time, in a way that it often
>> can never recover from [1].
>
> Interesting assertion.

I don't pretend to understand the complicated feedback loops that may
exist for workloads that are very reliant on pruning,
kill_prior_tuples cleanup, very aggressive vacuuming, etc. I'll just
say that it seems very reasonable to suppose that they exist.

My argument for the importance of index bloat to the more general
bloat problem is simple: any bloat that accumulates, that cannot be
cleaned up, will probably accumulate until it impacts performance
quite noticeably.

Index bloat may not seem as important as I suggest to some. The ideas
I'm talking about were received somewhat skeptically at pgCon, when I
brought them up informally. Maybe this is a more cogent explanation:
if your indexes were somehow magically never bloated, but the heap
could become just as bloated, then it might not matter that much
simply because the heap pages wouldn't be accessed by index scans.
Heap page accesses may demonstrably be the bottleneck today, without
that magic in place, but perhaps only because indexes point to the
bloat in the first place. It could be a bit like the situation with
bloating UNDO segments in Oracle; those are not the worst thing to
have to bloat. And, the kill_prior_tuples stuff is only as strong as
the weakest link (oldest snapshot).

> Many years ago I set to implement btree page
> merging[1] from ideas in a 1996 paper[2], though that work never saw the
> light of day.  Maybe it can be valuable now.

Lots of other database systems don't implement B-Tree page merging,
because it's hard to make work with techniques like Lehman & Yao
B-Link trees, and the average case space utilization still ends up
being good enough. It may be more worthwhile for us, though.
Apparently Kevin has some ideas here.

If I'm right about this index bloat issue, then I'd sooner tackle the
problem by preventing bloat in unique indexes in the fist place, by
using some additional indirection, a topic that I've said plenty about
recently. I think that you can sometimes, though fairly rarely, see
cases that get *really* out of hand, but with things like in-database
queues, that have unusually aggressive update patterns where a great
many duplicates are generated in indexes [1]. Our handling of
duplicates in unique indexes [2] is surely a problem there.

> Another thing to consider is indirect indexes, wherein you get less heap
> bloat because more updates can become HOT.

I think that the stuff I'm talking about, about having indirection for
the primary key and making sure unique indexes actually disallow even
physical duplicates actually enables indirect indexes. Remember how I
pointed out issues with unique indexes and VACUUM when you presented
on it at the pgCon unconference? I think that those problems may be
solvable through centralizing everything by making duplicates within
leaf pages physically impossible for unique indexes (not just
logically impossible). One chokepoint for dealing with bloat cleanup
for every index is the entire point of your indirect index design, but
that requires that you actually have no ambiguity about what every
logical pointer (in every secondary index) points to.

> If we also allow heap to be pruned from line pointers by walking indexes
> to remove specific pointers, instead of requiring indexes to be scanned
> whole for the removal, as proposed by Andres, we could get further
> along.

Yeah. I talked to Robert about doing that just today. That seems very
worthwhile, no matter how it ends up being used (whether it's by
VACUUM, something closer to an synchronous deletion, or whatever).

[1] https://brandur.org/postgres-queues
[2] https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement
-- 
Peter Geoghegan



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Masahiko Sawada
Дата:
On Thu, Jul 20, 2017 at 8:24 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> > At PGConf US Philly last week I was talking with Jim and Jan about
>> > performance. One of the items that came up is that PostgreSQL can't run
>> > full throttle for long periods of time. The long and short is that no
>> > matter what, autovacuum can't keep up. This is what I have done:
>>
>> Try reducing autovacuum_vacuum_cost_delay more, and/or increasing
>> autovacuum_vacuum_cost_limit.
>
> Or get rid of the cost delay entirely and let autovacuum actually go as
> fast as it can when it's run.  The assertion that it can't keep up is
> still plausible, but configuring autovacuum to sleep regularly and then
> complaining that it's not able to keep up doesn't make sense.
>
> Reducing the nap time might also be helpful if autovacuum is going as
> fast as it can and it's able to clear a table in less than a minute.
>
> There have been discussions on this list about parallel vacuum of a
> particular table as well; to address this issue I'd encourage reviewing
> those discussions and looking at writing a patch to implement that
> feature as that would address the case where the table is large enough
> that autovacuum simply can't get through all of it before the other
> backends have used all space available and then substantially increased
> the size of the relation (leading to vacuum on the table running for
> longer).

Yeah, the parallel vacuum of a particular table might help this issue
unless disk I/O is bottle-neck. I'm planning work on that.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Tom Lane
Дата:
Peter Geoghegan <pg@bowt.ie> writes:
> My argument for the importance of index bloat to the more general
> bloat problem is simple: any bloat that accumulates, that cannot be
> cleaned up, will probably accumulate until it impacts performance
> quite noticeably.

But that just begs the question: *does* it accumulate indefinitely, or
does it eventually reach a more-or-less steady state?  The traditional
wisdom about btrees, for instance, is that no matter how full you pack
them to start with, the steady state is going to involve something like
1/3rd free space.  You can call that bloat if you want, but it's not
likely that you'll be able to reduce the number significantly without
paying exorbitant costs.

I'm not claiming that we don't have any problems, but I do think it's
important to draw a distinction between bloat and normal operating
overhead.
        regards, tom lane



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Peter Geoghegan
Дата:
On Wed, Jul 19, 2017 at 7:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> My argument for the importance of index bloat to the more general
>> bloat problem is simple: any bloat that accumulates, that cannot be
>> cleaned up, will probably accumulate until it impacts performance
>> quite noticeably.
>
> But that just begs the question: *does* it accumulate indefinitely, or
> does it eventually reach a more-or-less steady state?

Yes, I believe it does reach a more-or-less steady state. It saturates
when there is a lot of contention, because then you actually can reuse
the bloat. If it didn't saturate, and instead became arbitrarily bad,
then we'd surely have heard about that before now.

The bloat is not entirely wasted, because it actually prevents you
from getting even more bloat in that part of the keyspace.

> The traditional
> wisdom about btrees, for instance, is that no matter how full you pack
> them to start with, the steady state is going to involve something like
> 1/3rd free space.  You can call that bloat if you want, but it's not
> likely that you'll be able to reduce the number significantly without
> paying exorbitant costs.

For the purposes of this discussion, I'm mostly talking about
duplicates within a page on a unique index. If the keyspace owned by
an int4 unique index page only covers 20 distinct values, it will only
ever cover 20 distinct values, now and forever, despite the fact that
there is room for about 400 (a 90/10 split leaves you with 366 items +
1 high key).

I don't know if I should really even call this bloat, since the term
is so overloaded, although this is what other database systems call
index bloat. I like to think of it as "damage to the keyspace",
although that terminology seems unlikely to catch on.

-- 
Peter Geoghegan



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Sokolov Yura
Дата:
On 2017-07-20 05:52, Masahiko Sawada wrote:
> On Thu, Jul 20, 2017 at 8:24 AM, Stephen Frost <sfrost@snowman.net> 
> wrote:
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>> > At PGConf US Philly last week I was talking with Jim and Jan about
>>> > performance. One of the items that came up is that PostgreSQL can't run
>>> > full throttle for long periods of time. The long and short is that no
>>> > matter what, autovacuum can't keep up. This is what I have done:
>>> 
>>> Try reducing autovacuum_vacuum_cost_delay more, and/or increasing
>>> autovacuum_vacuum_cost_limit.
>> 
>> Or get rid of the cost delay entirely and let autovacuum actually go 
>> as
>> fast as it can when it's run.  The assertion that it can't keep up is
>> still plausible, but configuring autovacuum to sleep regularly and 
>> then
>> complaining that it's not able to keep up doesn't make sense.
>> 
>> Reducing the nap time might also be helpful if autovacuum is going as
>> fast as it can and it's able to clear a table in less than a minute.
>> 
>> There have been discussions on this list about parallel vacuum of a
>> particular table as well; to address this issue I'd encourage 
>> reviewing
>> those discussions and looking at writing a patch to implement that
>> feature as that would address the case where the table is large enough
>> that autovacuum simply can't get through all of it before the other
>> backends have used all space available and then substantially 
>> increased
>> the size of the relation (leading to vacuum on the table running for
>> longer).
> 
> Yeah, the parallel vacuum of a particular table might help this issue
> unless disk I/O is bottle-neck. I'm planning work on that.
> 
> Regards,
> 
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center

I wrote two days ago about vacuum ring buffer:
https://www.postgresql.org/message-id/8737e9bddb82501da1134f021bf4929a%40postgrespro.ru

Increasing Vacuum's ring buffer to size of Bulk Writer's one reduces
autovacuum time in 3-10 times.
(for both patched and unpatched version I used single non-default 
setting
'autovacuum_cost_delay=2ms').

This is single line change, and it improves things a lot.

With regards,
-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Stephen Frost
Дата:
Greetings,

* Sokolov Yura (y.sokolov@postgrespro.ru) wrote:
> I wrote two days ago about vacuum ring buffer:
> https://www.postgresql.org/message-id/8737e9bddb82501da1134f021bf4929a%40postgrespro.ru
>
> Increasing Vacuum's ring buffer to size of Bulk Writer's one reduces
> autovacuum time in 3-10 times.
> (for both patched and unpatched version I used single non-default
> setting
> 'autovacuum_cost_delay=2ms').
>
> This is single line change, and it improves things a lot.

Right- when the database fits in the OS cache but not in shared_buffers.

I do agree that's a useful improvement to make based on your testing.

It's not clear off-hand how much that would improve this case, as
the database size appears to pretty quickly get beyond the OS memory
size (and only in the first test is the DB starting size less than
system memory to begin with).

Thanks!

Stephen

Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
"Joshua D. Drake"
Дата:
On 07/20/2017 06:28 AM, Stephen Frost wrote:

> It's not clear off-hand how much that would improve this case, as
> the database size appears to pretty quickly get beyond the OS memory
> size (and only in the first test is the DB starting size less than
> system memory to begin with).

FYI,

I will be posting new numbers in a few hours. I had been planning on 
posting this last night but... KDE.

JD


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Claudio Freire
Дата:
On Thu, Jul 20, 2017 at 12:08 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>> The traditional
>> wisdom about btrees, for instance, is that no matter how full you pack
>> them to start with, the steady state is going to involve something like
>> 1/3rd free space.  You can call that bloat if you want, but it's not
>> likely that you'll be able to reduce the number significantly without
>> paying exorbitant costs.
>
> For the purposes of this discussion, I'm mostly talking about
> duplicates within a page on a unique index. If the keyspace owned by
> an int4 unique index page only covers 20 distinct values, it will only
> ever cover 20 distinct values, now and forever, despite the fact that
> there is room for about 400 (a 90/10 split leaves you with 366 items +
> 1 high key).

Microvacuum could also help.

If during a scan you find pointers that point to dead (in vacuum terms)
tuples, the pointers in the index could be deleted. That could be done
during insert into unique indexes before a split, to avoid the split.

Chances are, if there are duplicates, at least a few of them will be dead.



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Jeff Janes
Дата:
On Thu, Jul 20, 2017 at 6:28 AM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Sokolov Yura (y.sokolov@postgrespro.ru) wrote:
> I wrote two days ago about vacuum ring buffer:
> https://www.postgresql.org/message-id/8737e9bddb82501da1134f021bf4929a%40postgrespro.ru
>
> Increasing Vacuum's ring buffer to size of Bulk Writer's one reduces
> autovacuum time in 3-10 times.
> (for both patched and unpatched version I used single non-default
> setting
> 'autovacuum_cost_delay=2ms').
>
> This is single line change, and it improves things a lot.

Right- when the database fits in the OS cache but not in shared_buffers.


On a system with a slow fsync, increasing the ring buffer helps a lot even if database doesn't fit in the OS cache. When the next buffer allocation runs into a dirtied buffer in the ring, it needs to sync the WAL up through that buffer's LSN before it can write it out and reuse it.  With a small ring, this means a lot of WAL flushing needs to be done.


I do agree that's a useful improvement to make based on your testing.

It's not clear off-hand how much that would improve this case, as
the database size appears to pretty quickly get beyond the OS memory
size (and only in the first test is the DB starting size less than
system memory to begin with).


Also, this system probably has a pretty fast fdatasync, considering it is SSD.
 
Cheers,

Jeff

Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Peter Geoghegan
Дата:
On Thu, Jul 20, 2017 at 7:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> For the purposes of this discussion, I'm mostly talking about
>> duplicates within a page on a unique index. If the keyspace owned by
>> an int4 unique index page only covers 20 distinct values, it will only
>> ever cover 20 distinct values, now and forever, despite the fact that
>> there is room for about 400 (a 90/10 split leaves you with 366 items +
>> 1 high key).
>
> Microvacuum could also help.
>
> If during a scan you find pointers that point to dead (in vacuum terms)
> tuples, the pointers in the index could be deleted. That could be done
> during insert into unique indexes before a split, to avoid the split.
>
> Chances are, if there are duplicates, at least a few of them will be dead.

My whole point is that that could easily fail to happen early enough
to prevent a pagesplit that is only needed because there is a short
term surge in the number of duplicate versions that need to be
available for one old snapshot. A pagesplit can be a permanent
solution to a temporary problem. Page deletion can only occur under
tight conditions that are unlikely to *ever* be met in many cases.

Imagine if it was impossible to insert physical duplicates into unique
indexes. In that world, you'd end up bloating some overflow data
structure in UPDATE heavy cases (where HOT doesn't work out). The
bloat wouldn't go on leaf pages, and so you wouldn't get page splits,
and so you wouldn't end up with leaf pages that can only store 20
distinct values now and forever, because that's the range of values
represented by downlinks and the leaf's high key. That's a situation
we actually saw for the leftmost leaf page in Alik's Zipfian
distribution test.

The way that the keyspace is broken up is supposed to be balanced, and
to have long term utility. Working against that to absorb a short term
bloat problem is penny wise, pound foolish.

-- 
Peter Geoghegan



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
"Joshua D. Drake"
Дата:
On 07/19/2017 07:57 PM, Tom Lane wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> My argument for the importance of index bloat to the more general
>> bloat problem is simple: any bloat that accumulates, that cannot be
>> cleaned up, will probably accumulate until it impacts performance
>> quite noticeably.
> 
> But that just begs the question: *does* it accumulate indefinitely, or
> does it eventually reach a more-or-less steady state?  The traditional
> wisdom about btrees, for instance, is that no matter how full you pack
> them to start with, the steady state is going to involve something like
> 1/3rd free space.  You can call that bloat if you want, but it's not
> likely that you'll be able to reduce the number significantly without
> paying exorbitant costs.
> 
> I'm not claiming that we don't have any problems, but I do think it's
> important to draw a distinction between bloat and normal operating
> overhead.

Agreed but we aren't talking about 30% I don't think. Here is where I am 
at. It took until 30 minutes ago for the tests to finish:
                name                 |  setting
-------------------------------------+----------- autovacuum                          | on
autovacuum_analyze_scale_factor    | 0.1 autovacuum_analyze_threshold        | 50 autovacuum_freeze_max_age           |
200000000autovacuum_max_workers              | 3 autovacuum_multixact_freeze_max_age | 400000000 autovacuum_naptime
            | 60 autovacuum_vacuum_cost_delay        | 20 autovacuum_vacuum_cost_limit        | -1
autovacuum_vacuum_scale_factor     | 0.2 autovacuum_vacuum_threshold         | 50 autovacuum_work_mem                 |
-1log_autovacuum_min_duration         | -1
 


Test 1: 55G    /srv/main
TPS:    955

Test 2: 112G    /srv/main
TPS:    531 (Not sure what happened here, long checkpoint?)

Test 3: 109G    /srv/main
TPS:    868

Test 4: 143G
TPS:    840

Test 5: 154G
TPS:     722

I am running the query here:

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index

And will post a followup. Once the query finishes I am going to launch 
the tests with autovacuum_vacuum_cost_limit of 5000. Is there anything 
else you folks would like me to change?

JD




-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Mark Kirkwood
Дата:
On 21/07/17 15:58, Joshua D. Drake wrote:

> On 07/19/2017 07:57 PM, Tom Lane wrote:
>> Peter Geoghegan <pg@bowt.ie> writes:
>>> My argument for the importance of index bloat to the more general
>>> bloat problem is simple: any bloat that accumulates, that cannot be
>>> cleaned up, will probably accumulate until it impacts performance
>>> quite noticeably.
>>
>> But that just begs the question: *does* it accumulate indefinitely, or
>> does it eventually reach a more-or-less steady state?  The traditional
>> wisdom about btrees, for instance, is that no matter how full you pack
>> them to start with, the steady state is going to involve something like
>> 1/3rd free space.  You can call that bloat if you want, but it's not
>> likely that you'll be able to reduce the number significantly without
>> paying exorbitant costs.
>>
>> I'm not claiming that we don't have any problems, but I do think it's
>> important to draw a distinction between bloat and normal operating
>> overhead.
>
> Agreed but we aren't talking about 30% I don't think. Here is where I 
> am at. It took until 30 minutes ago for the tests to finish:
>
>                 name                 |  setting
> -------------------------------------+-----------
>  autovacuum                          | on
>  autovacuum_analyze_scale_factor     | 0.1
>  autovacuum_analyze_threshold        | 50
>  autovacuum_freeze_max_age           | 200000000
>  autovacuum_max_workers              | 3
>  autovacuum_multixact_freeze_max_age | 400000000
>  autovacuum_naptime                  | 60
>  autovacuum_vacuum_cost_delay        | 20
>  autovacuum_vacuum_cost_limit        | -1
>  autovacuum_vacuum_scale_factor      | 0.2
>  autovacuum_vacuum_threshold         | 50
>  autovacuum_work_mem                 | -1
>  log_autovacuum_min_duration         | -1
>
>
> Test 1: 55G    /srv/main
> TPS:    955
>
> Test 2: 112G    /srv/main
> TPS:    531 (Not sure what happened here, long checkpoint?)
>
> Test 3: 109G    /srv/main
> TPS:    868
>
> Test 4: 143G
> TPS:    840
>
> Test 5: 154G
> TPS:     722
>
> I am running the query here:
>
> https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index 
>
>
> And will post a followup. Once the query finishes I am going to launch 
> the tests with autovacuum_vacuum_cost_limit of 5000. Is there anything 
> else you folks would like me to change?
>
>
>
>

I usually advise setting autovacuum_naptime = 10s (or even 5s) for 
workloads that do a lot of updates (or inserts + deletes) - as on modern 
HW a lot of churn can happen in 1 minute, and that just makes vacuum's 
job harder.

regards
Mark




Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Sokolov Yura
Дата:
On 2017-07-21 06:58, Joshua D. Drake wrote:
> On 07/19/2017 07:57 PM, Tom Lane wrote:
>> Peter Geoghegan <pg@bowt.ie> writes:
>>> My argument for the importance of index bloat to the more general
>>> bloat problem is simple: any bloat that accumulates, that cannot be
>>> cleaned up, will probably accumulate until it impacts performance
>>> quite noticeably.
>> 
>> But that just begs the question: *does* it accumulate indefinitely, or
>> does it eventually reach a more-or-less steady state?  The traditional
>> wisdom about btrees, for instance, is that no matter how full you pack
>> them to start with, the steady state is going to involve something 
>> like
>> 1/3rd free space.  You can call that bloat if you want, but it's not
>> likely that you'll be able to reduce the number significantly without
>> paying exorbitant costs.
>> 
>> I'm not claiming that we don't have any problems, but I do think it's
>> important to draw a distinction between bloat and normal operating
>> overhead.
> 
> Agreed but we aren't talking about 30% I don't think. Here is where I
> am at. It took until 30 minutes ago for the tests to finish:
> 
>                 name                 |  setting
> -------------------------------------+-----------
>  autovacuum                          | on
>  autovacuum_analyze_scale_factor     | 0.1
>  autovacuum_analyze_threshold        | 50
>  autovacuum_freeze_max_age           | 200000000
>  autovacuum_max_workers              | 3
>  autovacuum_multixact_freeze_max_age | 400000000
>  autovacuum_naptime                  | 60
>  autovacuum_vacuum_cost_delay        | 20
>  autovacuum_vacuum_cost_limit        | -1
>  autovacuum_vacuum_scale_factor      | 0.2
>  autovacuum_vacuum_threshold         | 50
>  autovacuum_work_mem                 | -1
>  log_autovacuum_min_duration         | -1
> 
> 
> Test 1: 55G    /srv/main
> TPS:    955
> 
> Test 2: 112G    /srv/main
> TPS:    531 (Not sure what happened here, long checkpoint?)
> 
> Test 3: 109G    /srv/main
> TPS:    868
> 
> Test 4: 143G
> TPS:    840
> 
> Test 5: 154G
> TPS:     722
> 
> I am running the query here:
> 
> https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index
> 
> And will post a followup. Once the query finishes I am going to launch
> the tests with autovacuum_vacuum_cost_limit of 5000. Is there anything
> else you folks would like me to change?
> 
> JD
> 
> 
> 
> 
> --
> Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
> 
> PostgreSQL Centered full stack support, consulting and development.
> Advocate: @amplifypostgres || Learn: https://pgconf.us
> *****     Unless otherwise stated, opinions are my own.   *****

Have you measured increased vacuum ring buffer?
This will require recompilation, though.

With regards,
-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
"Joshua D. Drake"
Дата:
On 07/20/2017 11:54 PM, Sokolov Yura wrote:
> On 2017-07-21 06:58, Joshua D. Drake wrote:
>> On 07/19/2017 07:57 PM, Tom Lane wrote:
>>> Peter Geoghegan <pg@bowt.ie> writes:

>>
>>
>> -- 
>> Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
>>
>> PostgreSQL Centered full stack support, consulting and development.
>> Advocate: @amplifypostgres || Learn: https://pgconf.us
>> *****     Unless otherwise stated, opinions are my own.   *****
> 
> Have you measured increased vacuum ring buffer?

No, not yet. I think we are still in the proving the problem stage.

JD


> This will require recompilation, though.
> 
> With regards,


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
"Joshua D. Drake"
Дата:
On 07/20/2017 08:58 PM, Joshua D. Drake wrote:
> On 07/19/2017 07:57 PM, Tom Lane wrote:
>> Peter Geoghegan <pg@bowt.ie> writes:

> Test 1: 55G    /srv/main
> TPS:    955
> 
> Test 2: 112G    /srv/main
> TPS:    531 (Not sure what happened here, long checkpoint?)
> 
> Test 3: 109G    /srv/main
> TPS:    868
> 
> Test 4: 143G
> TPS:    840
> 
> Test 5: 154G
> TPS:     722
> 
> I am running the query here:
> 
> https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index 

This query didn't finish after 12 hours. Here is the new set:
                name                 |  setting
-------------------------------------+----------- autovacuum                          | on
autovacuum_analyze_scale_factor    | 0.1 autovacuum_analyze_threshold        | 50 autovacuum_freeze_max_age           |
200000000autovacuum_max_workers              | 12 autovacuum_multixact_freeze_max_age | 400000000 autovacuum_naptime
             | 10 autovacuum_vacuum_cost_delay        | 0 autovacuum_vacuum_cost_limit        | 5000
autovacuum_vacuum_scale_factor     | 0.1 autovacuum_vacuum_threshold         | 50 autovacuum_work_mem                 |
-1log_autovacuum_min_duration         | -1
 

I have only ran one test but it is pretty telling:

Test 1: 60G     /srv/main
TPS: 914

Test 2: 92G     /srv/main
TPS: Still running

I will post a update after the third or fourth test depending on the 
numbers. I created this instance exactly for these tests so if someone 
wants to poke around I can give access.

Thanks,

JD


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
"Joshua D. Drake"
Дата:
Hello,

I changed the test to run for 6 hours at a time regardless of number of 
transactions. I also changed the du command to only look at the database 
(previously wal logs were included). This is the clearest indication of 
the problem I have been able to produce.

Again, this is with 128 clients and 500 warehouses. The first test is a 
clean test, everything dropped, vacuumed etc... Each subsequent test is 
just starting the test again to have breakpoints.


-------------------------------------+----------- autovacuum                          | on
autovacuum_analyze_scale_factor    | 0.1 autovacuum_analyze_threshold        | 50 autovacuum_freeze_max_age           |
200000000autovacuum_max_workers              | 12 autovacuum_multixact_freeze_max_age | 400000000 autovacuum_naptime
             | 10 autovacuum_vacuum_cost_delay        | 0 autovacuum_vacuum_cost_limit        | 5000
autovacuum_vacuum_scale_factor     | 0.1 autovacuum_vacuum_threshold         | 50 autovacuum_work_mem                 |
-1log_autovacuum_min_duration         | -1 max_wal_size                        | 640 checkpoint_timeout
  | 86400 checkpoint_completion_target        | 0.5
 

Starting base metric
50G     /srv/main/base

Test 1:
90G     /srv/main/base
TPS: 838

Test 2:
121G    /srv/main/base
TPS: 725

Test 3:
146G    /srv/main/base
TPS: 642

Test 4:
171G    /srv/main/base
TPS: 549

Test 5:
189G    /srv/main/base
TPS: 489

Test 6:
208G    /srv/main/base
TPS: 454

As you can see even with aggressive vacuuming, over a period of 36 hours 
life gets increasingly miserable.

The largest table is:

postgres=# select 
pg_size_pretty(pg_total_relation_size('bmsql_order_line')); pg_size_pretty
---------------- 148 GB
(1 row)

postgres=# \d bmsql_order_line             Table "public.bmsql_order_line"     Column     |            Type
|Modifiers
 
----------------+-----------------------------+----------- ol_w_id        | integer                     | not null
ol_d_id       | integer                     | not null ol_o_id        | integer                     | not null
ol_number     | integer                     | not null ol_i_id        | integer                     | not null
ol_delivery_d | timestamp without time zone | ol_amount      | numeric(6,2)                | ol_supply_w_id | integer
                 | ol_quantity    | integer                     | ol_dist_info   | character(24)               |
 
Indexes:    "bmsql_order_line_pkey" PRIMARY KEY, btree (ol_w_id, ol_d_id, 
ol_o_id, ol_number)
Foreign-key constraints:    "ol_order_fkey" FOREIGN KEY (ol_w_id, ol_d_id, ol_o_id) REFERENCES 
bmsql_oorder(o_w_id, o_d_id, o_id)    "ol_stock_fkey" FOREIGN KEY (ol_supply_w_id, ol_i_id) REFERENCES 
bmsql_stock(s_w_id, s_i_id)

With the PK being

postgres=# select pg_size_pretty(pg_relation_size('bmsql_order_line_pkey')); pg_size_pretty
---------------- 48 GB
(1 row)

I tried to see how much data we are dealing with here:

postgres=# select count(*) from bmsql_order_line;   count
----------- 910324839
(1 row)

Time: 503965.767 ms

And just to show that we were pushing to get these numbers:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle           2.38    0.00    2.20    1.98    0.00   93.44

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb            2027.40       239.99         0.05       1199          0
sda               0.80         0.00         0.01          0          0



So we have 910M rows, and it took 8.39941667 minutes to count them at 
240MB/s.

I know this is a lot of data and as I said previously, happy to let 
anyone look at it. However, we clearly have something deeper to look into.

Thanks in advance,

JD


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
"Joshua D. Drake"
Дата:
On 07/23/2017 12:03 PM, Joshua D. Drake wrote:

> As you can see even with aggressive vacuuming, over a period of 36 hours 
> life gets increasingly miserable.
> 
> The largest table is:
> 
> postgres=# select 
> pg_size_pretty(pg_total_relation_size('bmsql_order_line'));
>   pg_size_pretty
> ----------------
>   148 GB
> (1 row)
> 

[snip]

> With the PK being
> 
> postgres=# select 
> pg_size_pretty(pg_relation_size('bmsql_order_line_pkey'));
>   pg_size_pretty
> ----------------
>   48 GB
> (1 row)
> 
> I tried to see how much data we are dealing with here:

-hackers,

I cleaned up the table with VACUUM FULL and ended up with the following:

postgres=# select 
pg_size_pretty(pg_total_relation_size('bmsql_order_line')); pg_size_pretty
---------------- 118 GB
(1 row)

postgres=# select pg_size_pretty(pg_relation_size('bmsql_order_line_pkey')); pg_size_pretty
---------------- 27 GB
(1 row)

Does this suggest that we don't have a cleanup problem but a 
fragmentation problem (or both at least for the index)? Having an index 
that is almost twice the uncleaned up size isn't that uncommon.

Thanks in advance,

JD


-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
*****     Unless otherwise stated, opinions are my own.   *****



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Peter Geoghegan
Дата:
On Mon, Jul 24, 2017 at 10:50 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
> Does this suggest that we don't have a cleanup problem but a fragmentation
> problem (or both at least for the index)? Having an index that is almost
> twice the uncleaned up size isn't that uncommon.

As Tom pointed out up-thread, it's important to distinguish between
inherent overhead, and overhead due to garbage that needs to be
cleaned-up by vacuum. It's really hard to delineate which is which
here, and I'm not going to try to put a number on it. What I will
point out is that you can see quite a significant difference between
the space utilization of a B-Tree without any dead tuples, just from
the order in which tuples are initially inserted.

You can get about a 1/3 loss of space by inserting randomly, rather
than inserting in sorted order, which is what REINDEX will more or
less do for you. That's because random workloads almost entirely get
50:50 page splits, whereas sorted input will always split the
rightmost page, and so will always get 90:10 splits. The space in the
random case isn't exactly wasted; it's there for the taking, for key
values that happen to fit on the page. You effectively require a
larger average reserve of free space on pages with the random
workload, because the implementation does not and cannot reason that
it would be best to concentrate free space in parts of the keyspace
where there is most need for it.

That having been said, I do think that this workload suffers from
index bloat in a way that isn't so easily explained. It does seem to
be an issue with VACUUM controlling bloat in the index in particular.

-- 
Peter Geoghegan



Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

От
Peter Geoghegan
Дата:
On Mon, Jul 24, 2017 at 11:12 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> You can get about a 1/3 loss of space by inserting randomly, rather
> than inserting in sorted order, which is what REINDEX will more or
> less do for you. That's because random workloads almost entirely get
> 50:50 page splits, whereas sorted input will always split the
> rightmost page, and so will always get 90:10 splits. The space in the
> random case isn't exactly wasted; it's there for the taking, for key
> values that happen to fit on the page.

I looked into this again. I decided to see for myself what was up.
Actually, I used oltpbench [1], since that was the easiest TPC-C
benchmark available.

The order_line primary key may have been designed to be as unfriendly
as possible to implementations that don't have good suffix truncation.
This is also true of several of other TPC-C indexes.

I initialized a scale 50 TPC-C database:

pg@bat:~/code/oltpbench$ ./oltpbenchmark -b tpcc -c
config/my_postgres_tpcc_config.xml --create=true --load=true
15:26:57,214 (DBWorkload.java:259) INFO  -
======================================================================

Benchmark:     TPCC {com.oltpbenchmark.benchmarks.tpcc.TPCCBenchmark}
Configuration: config/sample_tpcc_config.xml
Type:          POSTGRES
Driver:        org.postgresql.Driver
URL:           jdbc:postgresql:tpcc
Isolation:     TRANSACTION_SERIALIZABLE
Scale Factor:  50.0

Note that there is no garbage for VACUUM to clean up just yet. There
hasn't been any deletes or updates yet. And yet, order_line_pkey is
bloated. It's 783 MB, but if I run REINDEX it shrinks right down to
451 MB. This is on the master branch -- not with one of my patches.
You can also arrange to make the index much smaller if you force the
insertions to occur in a totally random order, rather than the order
that the benchmark actually inserts them.

The problem for us is that tuples are inserted in
clustered/monotonically increasing order for the last 3 columns, while
the first column (ol_w_id) is a low cardinality column whose values
appear in random order, more or less. This isn't actually unrealistic
- it makes sense that associated records would be inserted as
per-warehouse groups like this, while the order of the groups remains
unpredictable.

I am working on adding suffix truncation at the moment. Right now, my
patch doesn't help, because it isn't sophisticated enough about the
choice of split point (we need to care about suffix truncation, and
not just balancing space among the two halves of a split). It should
try to find a split point that maximizes the chances of a new pivot
tuple not having any low cardinality final column (ol_number).

We should be able to mostly not have any of the last column in tuples
that make it into the branch/internal nodes:

pg@tpcc[1452]=# select count(*) from order_line ;
   count
────────────
 15,001,784
(1 row)

pg@tpcc[1452]=# select count(distinct(ol_w_id, ol_d_id, ol_o_id)) from
order_line ;
   count
───────────
 1,500,000
(1 row)

As I said, the final ol_number column makes the keyspace unbalanced by
unnecessarily appearing in the internal/branch nodes:

pg@tpcc[1452]=# select count(distinct(ol_number)) from order_line ;
 count
───────
    15
(1 row)

Since there can only be 15 distinct values for any given (ol_w_id,
ol_d_id, ol_o_id, *), and since over 100 index tuples will fit on each
leaf page, we just have to pick a split point that's between (x, y, z,
15) and (x, y, z + 1, 1). This makes it legal to truncate away the
ol_number column, which allows us to balance the use of free space for
items that are inserted after the split, by later transactions.

[1] https://github.com/oltpbenchmark/oltpbench
--
Peter Geoghegan