Обсуждение: Huge shared hit for small table
Hello all,
We are trying to debug some slow performance in our production environment (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node that seems… weird. This is a very large query involving a number of joins, but it performs pretty well in our staging environment (which has roughly the same data set as production, with a few tweaks). However, there is one node in the EXPLAIN plan that is wildly different:
In the staging environment, we get this:
Index Scan using "programPK" on public.program prog (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003 rows=1 loops=21965)
Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete, prog.setup_messages, prog.legacy_program_type
Index Cond: (prog.id = per.program)
Buffers: shared hit=87860
In the production environment, we get this:
Index Scan using "programPK" on public.program prog (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251 rows=1 loops=21956)
Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete, prog.setup_messages, prog.legacy_program_type
Index Cond: (prog.id = per.program)
Buffers: shared hit=25437716
The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.
Lastly, if we take out the join to the “program” table, the query performs much faster in production and the timing between staging and production is similar.
This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go away. Now it seems to be back, and I’m not sure what to make of it.
Thanks in advance for any help you can offer!
Scott
SCOTT RANKIN
VP, Technology
Motus, LLC
Two Financial Center, 60 South Street, Boston, MA 02111
617.467.1900 (O) | srankin@motus.com
Follow us on LinkedIn | Visit us at motus.com
This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message.
Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein.
Hi, On 2019-11-04 19:38:40 +0000, Scott Rankin wrote: > In the staging environment, we get this: > > Index Scan using "programPK" on public.program prog (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003 rows=1loops=21965) > Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete,prog.setup_messages, prog.legacy_program_type > Index Cond: (prog.id = per.program) > Buffers: shared hit=87860 > > In the production environment, we get this: > > Index Scan using "programPK" on public.program prog (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251 rows=1loops=21956) > Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete,prog.setup_messages, prog.legacy_program_type > Index Cond: (prog.id = per.program) > Buffers: shared hit=25437716 > > The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) –and the shared hits are pretty much the same on the other nodes of the query between the two environments. It'd be worthwhile to look at the index stats using pgstatindex. Also, could you show the definition of those indexes? > This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designateas confidential and proprietary. It is intended only for use of the individual or entity named above and shouldnot be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be usedfor any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the readerof this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient,you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited.If you have received this communication in error, please notify sender immediately and destroy the original message. > > Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent thepreceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is notintended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoidingFederal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussedherein. GNGNGGRR. Greetings, Andres Freund
On Mon, Nov 04, 2019 at 07:38:40PM +0000, Scott Rankin wrote: > In the staging environment, we get this: > > Index Scan using "programPK" on public.program prog (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003 rows=1loops=21965) > Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete,prog.setup_messages, prog.legacy_program_type > Index Cond: (prog.id = per.program) > Buffers: shared hit=87860 > > In the production environment, we get this: > > Index Scan using "programPK" on public.program prog (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251 rows=1loops=21956) > Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by,prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete,prog.setup_messages, prog.legacy_program_type > Index Cond: (prog.id = per.program) > Buffers: shared hit=25437716 > > The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) –and the shared hits are pretty much the same on the other nodes of the query between the two environments. I think it's because some heap pages are being visited many times, due to the index tuples being badly "fragmented". Note, I'm not talking about fragmentation of index *pages*, which is what pgstattuple reports (which wouldn't have nearly so detrimental effect). I could probably say that the index tuples are badly "correlated" with the heap. I'm guessing there are perhaps 25437716/87860 = 290 index tuples per page, and they rarely point to same heap page as their siblings. "Hit" means that this affects you even though it's cached (by postgres). So this is apparently slow due to reading each page ~300 times rather than once to get its tuples all at once. > This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go away. Now it seems to be back, and I’m not sure what to make of it. ..which is consistent with my hypothesis. You can use pg_repack or CREATE INDEX+DROP+RENAME hack (which is what pg_repack -i does). In a fresh index, its tuples are sorted by heap TID. You could CLUSTER the table itself (or pg_repack -t) on that index column. In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash affecting its progress reporting, fix to be included in v12.1). Justin
The index is exceedingly simple:
CREATE UNIQUE INDEX "programPK" ON program(id int8_ops);
From pg_stat_user_indexes:
Staging:
idx_scan: 5826745
idx_tup_read: 52715470
idx_tup_fetch: 52644465
Production:
idx_scan : 7277919087
idx_tup_read: 90612605047
idx_tup_fetch: 5207807880
From: Andres Freund <andres@anarazel.de>
Date: Monday, November 4, 2019 at 2:46 PM
To: Scott Rankin <srankin@motus.com>
Cc: "pgsql-performance@lists.postgresql.org" <pgsql-performance@lists.postgresql.org>
Subject: Re: Huge shared hit for small table
Hi,
On 2019-11-04 19:38:40 +0000, Scott Rankin wrote:
> In the staging environment, we get this:
>
> Index Scan using "programPK" on public.program prog (cost=0.29..0.35 rows=1 width=16) (actual time=0.002..0.003 rows=1 loops=21965)
> Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete, prog.setup_messages, prog.legacy_program_type
> Index Cond: (prog.id = per.program)
> Buffers: shared hit=87860
>
> In the production environment, we get this:
>
> Index Scan using "programPK" on public.program prog (cost=0.29..0.36 rows=1 width=16) (actual time=0.017..4.251 rows=1 loops=21956)
> Output: prog.id, prog.version, prog.active, prog.created_date, prog.last_modified_date, prog.created_by, prog.last_modified_by, prog.client_id, prog.scheme_id, prog.name, prog.legacy_group_id, prog.custom_fields, prog.setup_complete, prog.setup_messages, prog.legacy_program_type
> Index Cond: (prog.id = per.program)
> Buffers: shared hit=25437716
>
> The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.
It'd be worthwhile to look at the index stats using pgstatindex. Also,
could you show the definition of those indexes?
> This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message.
>
> Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein.
GNGNGGRR.
Greetings,
Andres Freund
This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message.
Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein.
On Mon, Nov 4, 2019 at 11:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I think it's because some heap pages are being visited many times, due to the > index tuples being badly "fragmented". Note, I'm not talking about > fragmentation of index *pages*, which is what pgstattuple reports (which > wouldn't have nearly so detrimental effect). I could probably say that the > index tuples are badly "correlated" with the heap. But this is a unique index, and Scott indicates that the problem seems to go away for a while following a REINDEX. > In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash > affecting its progress reporting, fix to be included in v12.1). PG v12 will store B-Tree duplicates in heap TID order, so if that's the problem then upgrading to v12 (and REINDEXing if the upgrade was performed using pg_upgrade) will fix it for good. -- Peter Geoghegan
Hi, On 2019-11-04 19:56:57 +0000, Scott Rankin wrote: > The index is exceedingly simple: > > > CREATE UNIQUE INDEX "programPK" ON program(id int8_ops); > > From pg_stat_user_indexes: > > Staging: > > idx_scan: 5826745 > idx_tup_read: 52715470 > idx_tup_fetch: 52644465 > > Production: > > idx_scan : 7277919087 > idx_tup_read: 90612605047 > idx_tup_fetch: 5207807880 I was basically asking for SELECT * FROM pgstatindex('pgstatindex'); with pgstatindex being from the pgstattuple extension https://www.postgresql.org/docs/current/pgstattuple.html not the pg_stat_user_indexes entry... Greetings, Andres Freund
Thanks to Justin for the clarification around pgstatindex: Staging: version2 tree_level1 index_size425984 root_block_no3 internal_pages1 leaf_pages50 empty_pages0 deleted_pages0 avg_leaf_density70.86 leaf_fragmentation16 Production: version2 tree_level1 index_size360448 root_block_no3 internal_pages1 leaf_pages41 empty_pages0 deleted_pages1 avg_leaf_density60.44 leaf_fragmentation39.02 On 11/4/19, 3:07 PM, "Peter Geoghegan" <pg@bowt.ie> wrote: On Mon, Nov 4, 2019 at 11:56 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I think it's because some heap pages are being visited many times, due to the > index tuples being badly "fragmented". Note, I'm not talking about > fragmentation of index *pages*, which is what pgstattuple reports (which > wouldn't have nearly so detrimental effect). I could probably say that the > index tuples are badly "correlated" with the heap. But this is a unique index, and Scott indicates that the problem seems to go away for a while following a REINDEX. > In PG v12 you can use REINDEX CONCURRENTLY (but beware there's a crash > affecting its progress reporting, fix to be included in v12.1). PG v12 will store B-Tree duplicates in heap TID order, so if that's the problem then upgrading to v12 (and REINDEXing if the upgrade was performed using pg_upgrade) will fix it for good. -- Peter Geoghegan This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designateas confidential and proprietary. It is intended only for use of the individual or entity named above and shouldnot be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be usedfor any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the readerof this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient,you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited.If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent thepreceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is notintended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoidingFederal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussedherein.
Hello all,
We are trying to debug some slow performance in our production environment (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node that seems… weird. This is a very large query involving a number of joins, but it performs pretty well in our staging environment (which has roughly the same data set as production, with a few tweaks). However, there is one node in the EXPLAIN plan that is wildly different:
The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.
This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go away. Now it seems to be back, and I’m not sure what to make of it.
On Mon, Nov 4, 2019 at 12:32 PM Jeff Janes <jeff.janes@gmail.com> wrote: > Could there be a long-open transaction, which is preventing hint-bits from getting on set on the table rows, as well onthe index rows? Contention on a small number of rows may also be a factor. > A reindex would not by itself fix the problem if it were the long open transaction. But if the long open transactionheld a sufficient lock on the table, then the reindex would block until the transaction went away on its own,at which point the problem would go away on its own, so it might **appear** to have fixed the problem. That seems like the simplest and most likely explanation to me, even though it isn't particularly simple. -- Peter Geoghegan
Definitely no long-running transactions on this table; in fact, this table is pretty infrequently updated – on the order of a few tens of rows updated per day.
From: Jeff Janes <jeff.janes@gmail.com>
Date: Monday, November 4, 2019 at 3:32 PM
To: Scott Rankin <srankin@motus.com>
Cc: "pgsql-performance@lists.postgresql.org" <pgsql-performance@lists.postgresql.org>
Subject: Re: Huge shared hit for small table
On Mon, Nov 4, 2019 at 2:38 PM Scott Rankin <srankin@motus.com> wrote:
Hello all,
We are trying to debug some slow performance in our production environment (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN node that seems… weird. This is a very large query involving a number of joins, but it performs pretty well in our staging environment (which has roughly the same data set as production, with a few tweaks). However, there is one node in the EXPLAIN plan that is wildly different:
Could there be a long-open transaction, which is preventing hint-bits from getting on set on the table rows, as well on the index rows?
...
The tables in both environments are about the same size (18MB) and the indexes are about the same size (360kb/410kb) – and the shared hits are pretty much the same on the other nodes of the query between the two environments.
If this table has more turn-over than those other tables (as measured in rows, not in percentage of the table), this would not be inconsistent with my theory.
This has happened one time before, and we did a “REINDEX” on the program table – and that made the problem mostly go away. Now it seems to be back, and I’m not sure what to make of it.
A reindex would not by itself fix the problem if it were the long open transaction. But if the long open transaction held a sufficient lock on the table, then the reindex would block until the transaction went away on its own, at which point the problem would go away on its own, so it might **appear** to have fixed the problem.
Cheers,
Jeff
This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designate as confidential and proprietary. It is intended only for use of the individual or entity named above and should not be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be used for any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately and destroy the original message.
Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent the preceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoiding Federal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussed herein.
On Mon, Nov 4, 2019 at 12:38 PM Scott Rankin <srankin@motus.com> wrote: > Definitely no long-running transactions on this table; in fact, this table is pretty infrequently updated – on the orderof a few tens of rows updated per day. But a long running transaction will have an impact on all tables -- not just the tables that happen to have been accessed so far in the long running transaction. This is necessary because nothing stops the long running transaction from SELECTing data from any table at any time -- we need to pessimistically keep around the data required to make that work. -- Peter Geoghegan
I think we have a winner. I looked in and found a process that was 'idle in transaction' for a couple days - and once Ikilled it, query performance went back to normal. Thank you all for the very quick responses on this. On 11/4/19, 3:41 PM, "Peter Geoghegan" <pg@bowt.ie> wrote: On Mon, Nov 4, 2019 at 12:38 PM Scott Rankin <srankin@motus.com> wrote: > Definitely no long-running transactions on this table; in fact, this table is pretty infrequently updated – on theorder of a few tens of rows updated per day. But a long running transaction will have an impact on all tables -- not just the tables that happen to have been accessed so far in the long running transaction. This is necessary because nothing stops the long running transaction from SELECTing data from any table at any time -- we need to pessimistically keep around the data required to make that work. -- Peter Geoghegan This email message contains information that Motus, LLC considers confidential and/or proprietary, or may later designateas confidential and proprietary. It is intended only for use of the individual or entity named above and shouldnot be forwarded to any other persons or entities without the express consent of Motus, LLC, nor should it be usedfor any purpose other than in the course of any potential or actual business relationship with Motus, LLC. If the readerof this message is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient,you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited.If you have received this communication in error, please notify sender immediately and destroy the original message. Internal Revenue Service regulations require that certain types of written advice include a disclaimer. To the extent thepreceding message contains advice relating to a Federal tax issue, unless expressly stated otherwise the advice is notintended or written to be used, and it cannot be used by the recipient or any other taxpayer, for the purpose of avoidingFederal tax penalties, and was not written to support the promotion or marketing of any transaction or matter discussedherein.
Definitely no long-running transactions on this table;