Обсуждение: [PERFORM] Optimizing around retained tuples

Поиск
Список
Период
Сортировка

[PERFORM] Optimizing around retained tuples

От
James Parks
Дата:
Hi pgsql-performance!

So I have a Postgresql database -- version "PostgreSQL 9.4.8 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit", specifically.

In it, I have essentially two categories of tables:
- small tables that are updated frequently, and tend to be often queried in their entirety (seq scan)
- large tables that are updated infrequently, and tend to be often queried using an index

Let us assume that I have a table "A" that falls in the small category, and a table "B" that falls in the large category.

The problem I'm having is that it is very difficult to do any amount of maintenance on tables like B without DOSing any queries that reference table A. The reason, as far as I can tell, is that having any statement run against a table like B results in all updates to table A being kept around until the statement on table B completes (as per the READ COMMITTED transaction isolation level -- statements against B must only see rows committed before they started). This makes sense -- it's required to keep ACID.

However, there are times where I need to do large operations to B -- and these operations can be literally anything, but I'll focus on my most recent need: running a "pg_dump" against table B.

I should add that table B is never involved with any query that touches table A -- in this case, it is an append-only table that records changes to a table C that is equivalently never involved with table A.

So, on to the data from which I base the above claims:

Let table A have 43 thousand rows:
database=> select count(*) from a;
-[ RECORD 1 ]
count | 43717
Time: 10447.681 ms

Let table B have 21 million rows:
meraki_shard_production=> select count(id) from b;
-[ RECORD 1 ]---
count | 21845610
Time: 116873.051 ms

Assume a pg_dump operation is copying table B, i.e. there's a currently running query that looks like "COPY public.b (id, ...) TO STDOUT"

Then this is what I get for running a verbose vacuum against A:

database=> vacuum verbose a;
INFO:  vacuuming "public.a"
INFO:  index "a_pkey" now contains 2119583 row versions in 9424 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.04s/0.03u sec elapsed 0.49 sec.
INFO:  "a": found 0 removable, 2112776 nonremovable row versions in 185345 out of 186312 pages
DETAIL:  2069676 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 1.28s/1.15u sec elapsed 22.93 sec.
INFO:  vacuuming "pg_toast.pg_toast_18889"
INFO:  index "pg_toast_18889_index" now contains 31 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_18889": found 0 removable, 31 nonremovable row versions in 7 out of 7 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 23035.282 ms

... and here's how long it takes to read all of the rows:
database=> select max(an unindexed bigint column) from a;
-[ RECORD 1 ]--------
max | <some number>
Time: 10624.368 ms

Running this another time immediately afterward (to show the cached speed) returns:
Time: 13782.363 ms

If I go to a separate database cluster that has an equivalent schema, and roughly equivalent table a (+- 2% on the number of rows), the above queries look more like this:

meraki_shard_production=> vacuum verbose a;
INFO:  vacuuming "public.a"
INFO:  index "a_pkey" now contains 42171 row versions in 162 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "a": found 487 removable, 42286 nonremovable row versions in 7809 out of 7853 pages
DETAIL:  373 dead row versions cannot be removed yet.
There were 42436 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.02u sec elapsed 0.01 sec.
INFO:  vacuuming "pg_toast.pg_toast_19037"
INFO:  index "pg_toast_19037_index" now contains 57 row versions in 2 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_19037": found 0 removable, 57 nonremovable row versions in 12 out of 12 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 32.890 ms

database=> select max(the same unindexed bigint column) from a;
       max      
-----------------
 <some number>
(1 row)
Time: 16.696 ms
(The second iteration takes 15.320 ms)

So, the way I see it, my problem boils down to table "A" getting roughly 100-1000x slower when it gets roughly 20-50x bigger (depending if you measure in pages or tuples). Unfortunately, in my use case, table "A" acts as a join table for a lot of aspects of our company's webapp. Every 10 minutes, the table is queried for 35 million rows via sequential scan (~800 seq scans per minute, ~1.3 per second on average), and 6.5 million rows via index lookup. When a sequential scan over 40k rows takes less than 1 second, everything is fine -- when it takes 10+ seconds the database starts to slow down significantly. Thankfully, queries can share sequential scans, but you can imagine how the responsiveness of the webapp might suffer as a consequence. There's also the secondary effect that, should the query on B complete, there now exist many queries against A (and other related tables) that are slow enough to potentially increase the size of A even further. It is not uncommon for queries involving A to start taking upwards of 30 minutes to complete, when they usually complete in roughly 300ms, after some maintenance query against B has completed.

Our go-to solution has been to detect and stop these maintenance queries if they take too long, and then to CLUSTER table A. This puts a cap on how long any maintenance query can take -- down to somewhere around 1 hour.

And thus my query to you guys:

What can I do to keep running long maintenance operations on large tables (SELECTing significant fractions of B, DELETEing significant fractions of B, running VACUUM FULL on B) without denying other Postgresql backends their ability to efficiently query table A? Or, in other words, how do I avoid incurring the cost of transaction isolation for queries against B on a case-by-case basis?

Anything is on the table for implementation:
- moving tables to a different database / cluster / completely different DBMS system
- designing an extension to tune either sets of queries
- partitioning tables
- etc
... although the simpler the better. If you were in this position, what would you do?

Regards,
James

Re: [PERFORM] Optimizing around retained tuples

От
Claudio Freire
Дата:
On Tue, Mar 21, 2017 at 4:24 PM, James Parks <james.parks@meraki.net> wrote:
> ... and here's how long it takes to read all of the rows:
> database=> select max(an unindexed bigint column) from a;
> -[ RECORD 1 ]--------
> max | <some number>
> Time: 10624.368 ms
>
> Running this another time immediately afterward (to show the cached speed)
> returns:
> Time: 13782.363 ms
>
> If I go to a separate database cluster that has an equivalent schema, and
> roughly equivalent table a (+- 2% on the number of rows), the above queries
> look more like this:
>
> meraki_shard_production=> vacuum verbose a;
> INFO:  vacuuming "public.a"
> INFO:  index "a_pkey" now contains 42171 row versions in 162 pages
> DETAIL:  0 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "a": found 487 removable, 42286 nonremovable row versions in 7809 out
> of 7853 pages
> DETAIL:  373 dead row versions cannot be removed yet.
> There were 42436 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.02u sec elapsed 0.01 sec.
> INFO:  vacuuming "pg_toast.pg_toast_19037"
> INFO:  index "pg_toast_19037_index" now contains 57 row versions in 2 pages
> DETAIL:  0 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "pg_toast_19037": found 0 removable, 57 nonremovable row versions in
> 12 out of 12 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> VACUUM
> Time: 32.890 ms
>
> database=> select max(the same unindexed bigint column) from a;
>        max
> -----------------
>  <some number>
> (1 row)
> Time: 16.696 ms
> (The second iteration takes 15.320 ms)
>
> So, the way I see it, my problem boils down to table "A" getting roughly
> 100-1000x slower when it gets roughly 20-50x bigger (depending if you
> measure in pages or tuples). Unfortunately, in my use case, table "A" acts
> as a join table for a lot of aspects of our company's webapp. Every 10
> minutes, the table is queried for 35 million rows via sequential scan (~800
> seq scans per minute, ~1.3 per second on average), and 6.5 million rows via
> index lookup. When a sequential scan over 40k rows takes less than 1 second,
> everything is fine -- when it takes 10+ seconds the database starts to slow
> down significantly. Thankfully, queries can share sequential scans, but you
> can imagine how the responsiveness of the webapp might suffer as a
> consequence. There's also the secondary effect that, should the query on B
> complete, there now exist many queries against A (and other related tables)
> that are slow enough to potentially increase the size of A even further. It
> is not uncommon for queries involving A to start taking upwards of 30
> minutes to complete, when they usually complete in roughly 300ms, after some
> maintenance query against B has completed.
>
> Our go-to solution has been to detect and stop these maintenance queries if
> they take too long, and then to CLUSTER table A. This puts a cap on how long
> any maintenance query can take -- down to somewhere around 1 hour.
>
> And thus my query to you guys:
>
> What can I do to keep running long maintenance operations on large tables
> (SELECTing significant fractions of B, DELETEing significant fractions of B,
> running VACUUM FULL on B) without denying other Postgresql backends their
> ability to efficiently query table A? Or, in other words, how do I avoid
> incurring the cost of transaction isolation for queries against B on a
> case-by-case basis?
>
> Anything is on the table for implementation:
> - moving tables to a different database / cluster / completely different
> DBMS system
> - designing an extension to tune either sets of queries
> - partitioning tables
> - etc
> ... although the simpler the better. If you were in this position, what
> would you do?
>
> Regards,
> James

You're experiencing bloat because the transaction on B is preventing
the xid horizon from moving forward, thus dead tuples from A cannot be
reclaimed in case the transaction on B decides to query them.

There's only one "easy" solution for this as far as I know, and it is
to run your long-running queries on a hot standby. That certainly
works for most read-only workloads, especially pg_dump.


Re: [PERFORM] Optimizing around retained tuples

От
Claudio Freire
Дата:
On Tue, Mar 21, 2017 at 10:56 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 4:24 PM, James Parks <james.parks@meraki.net> wrote:
>> ... and here's how long it takes to read all of the rows:
>> database=> select max(an unindexed bigint column) from a;
>> -[ RECORD 1 ]--------
>> max | <some number>
>> Time: 10624.368 ms
>>
>> Running this another time immediately afterward (to show the cached speed)
>> returns:
>> Time: 13782.363 ms
>>
>> If I go to a separate database cluster that has an equivalent schema, and
>> roughly equivalent table a (+- 2% on the number of rows), the above queries
>> look more like this:
>>
>> meraki_shard_production=> vacuum verbose a;
>> INFO:  vacuuming "public.a"
>> INFO:  index "a_pkey" now contains 42171 row versions in 162 pages
>> DETAIL:  0 index row versions were removed.
>> 0 index pages have been deleted, 0 are currently reusable.
>> CPU 0.00s/0.00u sec elapsed 0.00 sec.
>> INFO:  "a": found 487 removable, 42286 nonremovable row versions in 7809 out
>> of 7853 pages
>> DETAIL:  373 dead row versions cannot be removed yet.
>> There were 42436 unused item pointers.
>> 0 pages are entirely empty.
>> CPU 0.00s/0.02u sec elapsed 0.01 sec.
>> INFO:  vacuuming "pg_toast.pg_toast_19037"
>> INFO:  index "pg_toast_19037_index" now contains 57 row versions in 2 pages
>> DETAIL:  0 index row versions were removed.
>> 0 index pages have been deleted, 0 are currently reusable.
>> CPU 0.00s/0.00u sec elapsed 0.00 sec.
>> INFO:  "pg_toast_19037": found 0 removable, 57 nonremovable row versions in
>> 12 out of 12 pages
>> DETAIL:  0 dead row versions cannot be removed yet.
>> There were 0 unused item pointers.
>> 0 pages are entirely empty.
>> CPU 0.00s/0.00u sec elapsed 0.00 sec.
>> VACUUM
>> Time: 32.890 ms
>>
>> database=> select max(the same unindexed bigint column) from a;
>>        max
>> -----------------
>>  <some number>
>> (1 row)
>> Time: 16.696 ms
>> (The second iteration takes 15.320 ms)
>>
>> So, the way I see it, my problem boils down to table "A" getting roughly
>> 100-1000x slower when it gets roughly 20-50x bigger (depending if you
>> measure in pages or tuples). Unfortunately, in my use case, table "A" acts
>> as a join table for a lot of aspects of our company's webapp. Every 10
>> minutes, the table is queried for 35 million rows via sequential scan (~800
>> seq scans per minute, ~1.3 per second on average), and 6.5 million rows via
>> index lookup. When a sequential scan over 40k rows takes less than 1 second,
>> everything is fine -- when it takes 10+ seconds the database starts to slow
>> down significantly. Thankfully, queries can share sequential scans, but you
>> can imagine how the responsiveness of the webapp might suffer as a
>> consequence. There's also the secondary effect that, should the query on B
>> complete, there now exist many queries against A (and other related tables)
>> that are slow enough to potentially increase the size of A even further. It
>> is not uncommon for queries involving A to start taking upwards of 30
>> minutes to complete, when they usually complete in roughly 300ms, after some
>> maintenance query against B has completed.
>>
>> Our go-to solution has been to detect and stop these maintenance queries if
>> they take too long, and then to CLUSTER table A. This puts a cap on how long
>> any maintenance query can take -- down to somewhere around 1 hour.
>>
>> And thus my query to you guys:
>>
>> What can I do to keep running long maintenance operations on large tables
>> (SELECTing significant fractions of B, DELETEing significant fractions of B,
>> running VACUUM FULL on B) without denying other Postgresql backends their
>> ability to efficiently query table A? Or, in other words, how do I avoid
>> incurring the cost of transaction isolation for queries against B on a
>> case-by-case basis?
>>
>> Anything is on the table for implementation:
>> - moving tables to a different database / cluster / completely different
>> DBMS system
>> - designing an extension to tune either sets of queries
>> - partitioning tables
>> - etc
>> ... although the simpler the better. If you were in this position, what
>> would you do?
>>
>> Regards,
>> James
>
> You're experiencing bloat because the transaction on B is preventing
> the xid horizon from moving forward, thus dead tuples from A cannot be
> reclaimed in case the transaction on B decides to query them.
>
> There's only one "easy" solution for this as far as I know, and it is
> to run your long-running queries on a hot standby. That certainly
> works for most read-only workloads, especially pg_dump.

Forgot to clarify... for your use case, make sure you *don't* enable
standby feedback on the standby.


Re: Optimizing around retained tuples

От
Brad DeJong
Дата:
On Tue, Mar 21, 2017 at 4:24 PM, James Parks <james.parks@meraki.net> wrote:
> What can I do to keep running long maintenance operations on large
> tables (SELECTing significant fractions of B, DELETEing significant
> fractions of B, running VACUUM FULL on B) without denying other
> Postgresql backends their ability to efficiently query table A? 
> 
> Anything is on the table for implementation:
>  - moving tables to a different database / cluster / completely different DBMS system
>  - designing an extension to tune either sets of queries
>  - partitioning tables
>  - etc

The PostgreSQL 9.6 old_snapshot_threshold feature may be useful for this situation.

From the patch proposal e-mail "... Basically, this patch aims to limit bloat when there are snapshots
that are kept registered for prolonged periods. ...".

I think that matches your description.

PgCon 2016 presentation - https://www.pgcon.org/2016/schedule/attachments/420_snapshot-too-old.odp
CommitFest entry - https://commitfest.postgresql.org/9/562/

On Tue, Mar 21, 2017 at 10:56 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> You're experiencing bloat because the transaction on B is preventing 
> the xid horizon from moving forward, thus dead tuples from A cannot be 
> reclaimed in case the transaction on B decides to query them.

Setting old_snapshot_threshold to a positive value changes that behavior.

Instead of holding on to the "dead" tuples in A so that the transaction
on B can query them in the future, the tuples are vaccuumed and the
transaction on B gets a "snapshot too old" error if it tries to read a
page in A where a tuple was vaccuumed.

There are also discussions on pgsql-hackers ("pluggable storage" and "UNDO
and in-place update") regarding alternate table formats that might work
better in this situation. But it doesn't look like either of those will
make it into PostgreSQL 10.