Обсуждение: Huge shared hit for small table

Поиск

Список

Период

Сортировка

Huge shared hit for small table

От

Scott Rankin

Дата:

04 ноября 2019 г., 22:38:40

Hello all,

We are trying to debug some slow performance in our production environment (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node that seems… weird. This is a very large query involving a number of joins, but it performs pretty well in our staging environment (which has roughly the same data set as production, with a few tweaks). However, there is one node in the EXPLAIN plan that is wildly different:

In the staging environment, we get this:

Index Scan using "programPK" on public.program prog (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003 rows=1 loops=21965)

Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete, prog.setup_messages, prog.legacy_program_type

Index Cond: (prog.id = per.program)

Buffers: shared hit=87860

In the production environment, we get this:

Index Scan using "programPK" on public.program prog (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251 rows=1 loops=21956)

Index Cond: (prog.id = per.program)

Buffers: shared hit=25437716

The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.

Lastly, if we take out the join to the “program” table, the query performs much faster in production and the timing between staging and production is similar.

This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go away. Now it seems to be back, and I’m not sure what to make of it.

Thanks in advance for any help you can offer!

Scott

SCOTT RANKIN
VP, Technology

Motus, LLC
Two Financial Center, 60 South Street, Boston, MA 02111
617.467.1900 (O) | srankin@motus.com

This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message.

Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein.

Re: Huge shared hit for small table

От

Andres Freund

Дата:

04 ноября 2019 г., 22:46:29

Hi,

On 2019-11-04 19:38:40 +0000, Scott Rankin wrote:
> In the staging environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003
rows=1loops=21965)
 
>       Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by,
prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields,
prog.setup_complete,prog.setup_messages, prog.legacy_program_type
 
>       Index Cond: (prog.id = per.program)
>       Buffers: shared hit=87860
> 
> In the production environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251
rows=1loops=21956)
 
>        Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by,
prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields,
prog.setup_complete,prog.setup_messages, prog.legacy_program_type
 
>        Index Cond: (prog.id = per.program)
>        Buffers: shared hit=25437716
> 
> The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb)
–and the shared hits are pretty much the same on the other nodes of the query between the two environments.
 

It'd be worthwhile to look at the index stats using pgstatindex. Also,
could you show the definition of those indexes?


> This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later
designateas confidential and proprietary. It is intended only for use of the individual or entity named above and
shouldnot be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be
usedfor any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the
readerof this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended
recipient,you are hereby notified that any dissemination, distribution, or copying of this communication is strictly
prohibited.If you have received this communication in error, please notify sender immediately and destroy the original
message.
> 
> Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent
thepreceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is
notintended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of
avoidingFederal tax penalties, and was not written to support the promotion or marketing of any transaction or matter
discussedherein.
 

GNGNGGRR.

Greetings,

Andres Freund

Re: Huge shared hit for small table

От

Justin Pryzby

Дата:

04 ноября 2019 г., 22:56:49

On Mon, Nov 04, 2019 at 07:38:40PM +0000, Scott Rankin wrote:
> In the staging environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003
rows=1loops=21965)

>       Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by,
prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields,
prog.setup_complete,prog.setup_messages, prog.legacy_program_type

>       Index Cond: (prog.id = per.program)
>       Buffers: shared hit=87860
> 
> In the production environment, we get this:
> 
> Index Scan using "programPK" on public.program prog  (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251
rows=1loops=21956)

>        Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by,
prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields,
prog.setup_complete,prog.setup_messages, prog.legacy_program_type

>        Index Cond: (prog.id = per.program)
>        Buffers: shared hit=25437716
> 
> The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb)
–and the shared hits are pretty much the same on the other nodes of the query between the two environments.

I think it's because some heap pages are being visited many times, due to the
index tuples being badly "fragmented".  Note, I'm not talking about
fragmentation of index *pages*, which is what pgstattuple reports (which
wouldn't have nearly so detrimental effect).  I could probably say that the
index tuples are badly "correlated" with the heap.

I'm guessing there are perhaps 25437716/87860 = 290 index tuples per page, and
they rarely point to same heap page as their siblings.  "Hit" means that this
affects you even though it's cached (by postgres).  So this is apparently slow
due to reading each page ~300 times rather than once to get its tuples all at
once.

> This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go
away. Now it seems to be back, and I’m not sure what to make of it.

..which is consistent with my hypothesis.

You can use pg_repack or CREATE INDEX+DROP+RENAME hack (which is what pg_repack
-i does).  In a fresh index, its tuples are sorted by heap TID.  You could
CLUSTER the table itself (or pg_repack -t) on that index column.

In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash
affecting its progress reporting, fix to be included in v12.1).

Justin

Re: Huge shared hit for small table

От

Scott Rankin

Дата:

04 ноября 2019 г., 22:56:57

The index is exceedingly simple:

CREATE UNIQUE INDEX "programPK" ON program(id int8_ops);

From pg_stat_user_indexes:

Staging:

idx_scan: 5826745

idx_tup_read: 52715470

idx_tup_fetch: 52644465

Production:

idx_scan : 7277919087

idx_tup_read: 90612605047

idx_tup_fetch: 5207807880

From: Andres Freund <andres@anarazel.de>
Date: Monday, November 4, 2019 at 2:46 PM
To: Scott Rankin <srankin@motus.com>
Cc: "pgsql-performance@lists.postgresql.org" <pgsql-performance@lists.postgresql.org>
Subject: Re: Huge shared hit for small table

Hi,

On 2019-11-04 19:38:40 +0000, Scott Rankin wrote:
> In the staging environment, we get this:
>
> Index Scan using "programPK" on public.program prog (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003 rows=1 loops=21965)
> Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete, prog.setup_messages, prog.legacy_program_type
> Index Cond: (prog.id = per.program)
> Buffers: shared hit=87860
>
> In the production environment, we get this:
>
> Index Scan using "programPK" on public.program prog (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251 rows=1 loops=21956)
> Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete, prog.setup_messages, prog.legacy_program_type
> Index Cond: (prog.id = per.program)
> Buffers: shared hit=25437716
>
> The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.

It'd be worthwhile to look at the index stats using pgstatindex. Also,
could you show the definition of those indexes?

> This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message.
>
> Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein.

GNGNGGRR.

Greetings,

Andres Freund

Re: Huge shared hit for small table

От

Peter Geoghegan

Дата:

04 ноября 2019 г., 23:07:27

On Mon, Nov 4, 2019 at 11:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I think it's because some heap pages are being visited many times, due to the
> index tuples being badly "fragmented".  Note, I'm not talking about
> fragmentation of index *pages*, which is what pgstattuple reports (which
> wouldn't have nearly so detrimental effect).  I could probably say that the
> index tuples are badly "correlated" with the heap.

But this is a unique index, and Scott indicates that the problem seems
to go away for a while following a REINDEX.

> In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash
> affecting its progress reporting, fix to be included in v12.1).

PG v12 will store B-Tree duplicates in heap TID order, so if that's
the problem then upgrading to v12 (and REINDEXing if the upgrade was
performed using pg_upgrade) will fix it for good.

-- 
Peter Geoghegan

Re: Huge shared hit for small table

От

Andres Freund

Дата:

04 ноября 2019 г., 23:17:28

Hi,

On 2019-11-04 19:56:57 +0000, Scott Rankin wrote:
> The index is exceedingly simple:
>
>
> CREATE UNIQUE INDEX "programPK" ON program(id int8_ops);
>
> From pg_stat_user_indexes:
>
> Staging:
>
> idx_scan: 5826745
> idx_tup_read: 52715470
> idx_tup_fetch: 52644465
>
> Production:
>
> idx_scan : 7277919087
> idx_tup_read: 90612605047
> idx_tup_fetch: 5207807880

I was basically asking for SELECT * FROM pgstatindex('pgstatindex');
with pgstatindex being from the pgstattuple extension
https://www.postgresql.org/docs/current/pgstattuple.html
not the pg_stat_user_indexes entry...

Greetings,

Andres Freund

Re: Huge shared hit for small table

От

Scott Rankin

Дата:

04 ноября 2019 г., 23:18:03

Thanks to Justin for the clarification around pgstatindex:

Staging:

version2
tree_level1
index_size425984
root_block_no3
internal_pages1
leaf_pages50
empty_pages0
deleted_pages0
avg_leaf_density70.86
leaf_fragmentation16

Production:

version2
tree_level1
index_size360448
root_block_no3
internal_pages1
leaf_pages41
empty_pages0
deleted_pages1
avg_leaf_density60.44
leaf_fragmentation39.02

On 11/4/19, 3:07 PM, "Peter Geoghegan" <pg@bowt.ie> wrote:

    On Mon, Nov 4, 2019 at 11:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
    > I think it's because some heap pages are being visited many times, due to the
    > index tuples being badly "fragmented".  Note, I'm not talking about
    > fragmentation of index *pages*, which is what pgstattuple reports (which
    > wouldn't have nearly so detrimental effect).  I could probably say that the
    > index tuples are badly "correlated" with the heap.

    But this is a unique index, and Scott indicates that the problem seems
    to go away for a while following a REINDEX.

    > In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash
    > affecting its progress reporting, fix to be included in v12.1).

    PG v12 will store B-Tree duplicates in heap TID order, so if that's
    the problem then upgrading to v12 (and REINDEXing if the upgrade was
    performed using pg_upgrade) will fix it for good.

    --
    Peter Geoghegan

This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later
designateas confidential and proprietary. It is intended only for use of the individual or entity named above and
shouldnot be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be
usedfor any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the
readerof this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended
recipient,you are hereby notified that any dissemination, distribution, or copying of this communication is strictly
prohibited.If you have received this communication in error, please notify sender immediately and destroy the original
message.

Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent
thepreceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is
notintended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of
avoidingFederal tax penalties, and was not written to support the promotion or marketing of any transaction or matter
discussedherein.

Re: Huge shared hit for small table

От

Jeff Janes

Дата:

04 ноября 2019 г., 23:31:52

On Mon, Nov 4, 2019 at 2:38 PM Scott Rankin <srankin@motus.com> wrote:

Hello all,

We are trying to debug some slow performance in our production environment (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node that seems… weird. This is a very large query involving a number of joins, but it performs pretty well in our staging environment (which has roughly the same data set as production, with a few tweaks). However, there is one node in the EXPLAIN plan that is wildly different:

Could there be a long-open transaction, which is preventing hint-bits from getting on set on the table rows, as well on the index rows?

...

The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.

If this table has more turn-over than those other tables (as measured in rows, not in percentage of the table), this would not be inconsistent with my theory.

This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go away. Now it seems to be back, and I’m not sure what to make of it.

A reindex would not by itself fix the problem if it were the long open transaction. But if the long open transaction held a sufficient lock on the table, then the reindex would block until the transaction went away on its own, at which point the problem would go away on its own, so it might **appear** to have fixed the problem.

Cheers,

Jeff

Re: Huge shared hit for small table

От

Peter Geoghegan

Дата:

04 ноября 2019 г., 23:34:04

On Mon, Nov 4, 2019 at 12:32 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> Could there be a long-open transaction, which is preventing hint-bits from getting on set on the table rows, as well
onthe index rows? 

Contention on a small number of rows may also be a factor.

> A reindex would not by itself fix the problem if it were the long open transaction.  But  if the long open
transactionheld a sufficient lock on the table, then the reindex would block until the transaction went away on its
own,at which point the problem would go away on its own, so it might **appear** to have fixed the problem. 

That seems like the simplest and most likely explanation to me, even
though it isn't particularly simple.

--
Peter Geoghegan

Re: Huge shared hit for small table

От

Scott Rankin

Дата:

04 ноября 2019 г., 23:38:36

Definitely no long-running transactions on this table; in fact, this table is pretty infrequently updated – on the order of a few tens of rows updated per day.

From: Jeff Janes <jeff.janes@gmail.com>
Date: Monday, November 4, 2019 at 3:32 PM
To: Scott Rankin <srankin@motus.com>
Cc: "pgsql-performance@lists.postgresql.org" <pgsql-performance@lists.postgresql.org>
Subject: Re: Huge shared hit for small table

On Mon, Nov 4, 2019 at 2:38 PM Scott Rankin <srankin@motus.com> wrote:

Hello all,

We are trying to debug some slow performance in our production environment (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node that seems… weird. This is a very large query involving a number of joins, but it performs pretty well in our staging environment (which has roughly the same data set as production, with a few tweaks). However, there is one node in the EXPLAIN plan that is wildly different:

Could there be a long-open transaction, which is preventing hint-bits from getting on set on the table rows, as well on the index rows?

...

The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.

If this table has more turn-over than those other tables (as measured in rows, not in percentage of the table), this would not be inconsistent with my theory.

This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go away. Now it seems to be back, and I’m not sure what to make of it.

Cheers,

Jeff

Re: Huge shared hit for small table

От

Peter Geoghegan

Дата:

04 ноября 2019 г., 23:40:46

On Mon, Nov 4, 2019 at 12:38 PM Scott Rankin <srankin@motus.com> wrote:
> Definitely no long-running transactions on this table; in fact, this table is pretty infrequently updated – on the
orderof a few tens of rows updated per day. 

But a long running transaction will have an impact on all tables --
not just the tables that happen to have been accessed so far in the
long running transaction. This is necessary because nothing stops the
long running transaction from SELECTing data from any table at any
time -- we need to pessimistically keep around the data required to
make that work.

--
Peter Geoghegan

Re: Huge shared hit for small table

От

Scott Rankin

Дата:

05 ноября 2019 г., 00:00:15

I think we have a winner.  I looked in and found a process that was 'idle in transaction' for a couple days - and once
Ikilled it, query performance went back to normal.
 

Thank you all for the very quick responses on this.

On 11/4/19, 3:41 PM, "Peter Geoghegan" <pg@bowt.ie> wrote:

    On Mon, Nov 4, 2019 at 12:38 PM Scott Rankin <srankin@motus.com> wrote:
    > Definitely no long-running transactions on this table; in fact, this table is pretty infrequently updated – on
theorder of a few tens of rows updated per day.
 

    But a long running transaction will have an impact on all tables --
    not just the tables that happen to have been accessed so far in the
    long running transaction. This is necessary because nothing stops the
    long running transaction from SELECTing data from any table at any
    time -- we need to pessimistically keep around the data required to
    make that work.

    --
    Peter Geoghegan



This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later
designateas confidential and proprietary. It is intended only for use of the individual or entity named above and
shouldnot be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be
usedfor any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the
readerof this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended
recipient,you are hereby notified that any dissemination, distribution, or copying of this communication is strictly
prohibited.If you have received this communication in error, please notify sender immediately and destroy the original
message.

Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent
thepreceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is
notintended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of
avoidingFederal tax penalties, and was not written to support the promotion or marketing of any transaction or matter
discussedherein.

Re: Huge shared hit for small table

От

Jeff Janes

Дата:

05 ноября 2019 г., 00:05:12

On Mon, Nov 4, 2019 at 3:38 PM Scott Rankin <srankin@motus.com> wrote:

Definitely no long-running transactions on this table;

Any long running transactions at all? The lock on the table is only necessary to explain why the problem would have gone away at the same time as the reindex finished. If there is a long running transaction which doesn't touch this table, it would still cause the problem. It is just that the reindinex would not solve the problem (because the not-entirely-dead-yet tuples would have to be copied into the new index), and with no lock there is no reason for them to be correlated in time, other than sheer dumb luck.

Does another reindex solve the problem again?

> in fact, this table is pretty infrequently updated – on the order of a few tens of rows updated per day.

That would seem to argue against this explanations, but all the others ones too I think. But a few tens of rows per day and a transaction left open for a few tens of days, and you could get enough zombie tuples to add up to trouble. Particularly if there is one row (as defined by prog.id) which is seeing both most of those updates, an most of the index-scan activity.

But now I am curious, if it is a small table and the index scan is going to be invoked 21,956 times in one query, it seems like it should hash it instead. Does it misestimate how often that index scan is going to get invoked? (assuming the index scan is the 2nd child of a nested loop, what is the expected and actual row count of the 1st child of that loop?)

Cheers,

Jeff

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Huge shared hit for small table