Count backend self-sync calls

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Count backend self-sync calls
Дата
Msg-id 4CE05DB9.9080309@2ndquadrant.com
обсуждение исходный текст
Ответы Re: Count backend self-sync calls  (Robert Haas <robertmhaas@gmail.com>)
Re: Count backend self-sync calls  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-hackers
The attached patch adds a new field to pg_stat_bgwriter, counting the
number of times backends execute their own fsync calls.  Normally, when
a backend needs to fsync data, it passes a request to the background
writer, which then absorbs the call into its own queue of work to do.
However, under some types of heavy system load, the associated queue can
fill.  When this happens, backends are forced to do their own fsync
call.  This is potentially much worse than when they do a regular write.

The really nasty situation is when the background writer is busy because
it's executing a checkpoint.  In that case, it's possible for the
backend fsync calls to start competing with the ones the background
writer is trying to get done, causing the checkpoint sync phase to
execute slower than it should.  I've seen the sync phase take over 45
minutes on a really busy server once it got into this condition, where
hundreds of clients doing their own backend fsync calls were fighting
against the checkpoint fsync work.  With this patch, you can observe
that happening as an upwards spike in
pg_stat_bgwriter.buffers_backend_sync, which as documented is an
inclusive subset of the total shown in buffers_backend.

While it takes a busier system than I can useful show how to simulate
here to show a really bad situation, I'm able to see some of these
unabsorbed backend fsync calls when initializing a pgbench database, to
prove they happen in the lab.  The attached test program takes as its
input a pgbench scale counter.  It then creates a pgbench database
(deleting any existing pgbench database, so watch out for that) and
shows the values accumulated in pg_stat_bgwriter during that period.
Here's an example, using the script's default scale of 100 on a server
with 8GB of RAM and fake fsync (the hard drives are lying about it):

-[ RECORD 1 ]--------+-----------------------------
now                  | 2010-11-14 16:08:41.36421-05
...
Initializing pgbench
-[ RECORD 1 ]--------+------------------------------
now                  | 2010-11-14 16:09:46.713693-05
checkpoints_timed    | 0
checkpoints_req      | 0
buffers_checkpoint   | 0
buffers_clean        | 0
maxwritten_clean     | 0
buffers_backend      | 654716
buffers_backend_sync | 90
buffers_alloc        | 803

This is with default sizing for memory structures.  As you increase
shared_buffers, one of the queues involved here increases
proportionately, making it less likely to run into this problem.  That
just changes it to the kind of problem I've only seen on a larger system
with a difficult to simulate workload.  The production system getting
hammered with this problem (running a web application) that prompted
writing the patch had shared_buffers=4GB at the time.

The patch also adds some logging to the innards involved here, to help
with understanding problems in this area.  I don't think that should be
in the version committed as is.  May want to drop the logging level or
make it disabled in regular builds, since it is sitting somewhere it
generates a lot of log data and adds overhead.  It is nice for now, as
it lets you get an idea how much fsync work *is* being absorbed by the
BGW, as well as showing what relation is suffering from this issue.
Example of both those things, with the default config for everything
except log_checkpoints (on) and log_min_messages (debug1):

DEBUG:  Absorbing 4096 fsync requests
DEBUG:  Absorbing 150 fsync requests
DEBUG:  Unable to forward fsync request, executing directly
CONTEXT:  writing block 158638 of relation base/16385/16398

Here 4096 is the most entries the BGW will ever absorb at once, and all
90 of the missed sync calls are logged so you can see what files they
came from.

As a high-level commentary about this patch, I'm not sure what most end
users will ever do with this data.  At the same time, I wasn't sure what
a typical end user would do with anything else in pg_stat_bgwriter
either when it was added, and it turns out the answer is "wait for
people who know the internals to write things that monitor it".  For
example, Magnus has updated recent versions of the Munin plug-in for
PostgreSQL to usefully graph pg_stat_bgwriter data over time.  As I have
some data to suggest checkpoint problems on Linux in particular are
getting worse as total system memory increases, I expect that having a
way to easily instrument for this particular problem will be
correspondingly more important in the future too.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us


diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dbf966b..f701b8f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -264,8 +264,10 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       and the number of times the background writer stopped a cleaning scan
       because it had written too many buffers.  Also includes
       statistics about the shared buffer pool, including buffers written
-      by backends (that is, not by the background writer) and total buffers
-      allocated.
+      by backends (that is, not by the background writer), how many times
+      those backends had to execute their own filesystem synchronize calls
+      (normally the background writer handles those even when the backend
+      writes itself), and total buffers allocated.
      </entry>
      </row>

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 09574c3..8766ccc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -537,6 +537,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
         pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
         pg_stat_get_buf_written_backend() AS buffers_backend,
+        pg_stat_get_buf_sync_backend() AS buffers_backend_sync,
         pg_stat_get_buf_alloc() AS buffers_alloc;

 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0690ab5..43a149e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -102,13 +102,15 @@
  *
  * num_backend_writes is used to count the number of buffer writes performed
  * by non-bgwriter processes.  This counter should be wide enough that it
- * can't overflow during a single bgwriter cycle.
+ * can't overflow during a single bgwriter cycle.  num_backend_fsync
+ * counts the subset of those writes that also had to do their own fsync,
+ * because the background writer failed to absorb their request.
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the bgwriter.
  *
- * Unlike the checkpoint fields, num_backend_writes and the requests
- * fields are protected by BgWriterCommLock.
+ * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
+ * the requests fields are protected by BgWriterCommLock.
  *----------
  */
 typedef struct
@@ -132,6 +134,7 @@ typedef struct
     int            ckpt_flags;        /* checkpoint flags, as defined in xlog.h */

     uint32        num_backend_writes;        /* counts non-bgwriter buffer writes */
+    uint32        num_backend_fsync;        /* counts non-bgwriter fsync calls */

     int            num_requests;    /* current # of requests */
     int            max_requests;    /* allocated array size */
@@ -1084,12 +1087,14 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,

     LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);

-    /* we count non-bgwriter writes even when the request queue overflows */
+    /* Count all backend writes regardless of if they fit in the queue */
     BgWriterShmem->num_backend_writes++;

     if (BgWriterShmem->bgwriter_pid == 0 ||
         BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
     {
+        /* Also count the subset where backends have to do their own fsync */
+        BgWriterShmem->num_backend_fsync++;
         LWLockRelease(BgWriterCommLock);
         return false;
     }
@@ -1137,7 +1142,10 @@ AbsorbFsyncRequests(void)

     /* Transfer write count into pending pgstats message */
     BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
+    BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
+
     BgWriterShmem->num_backend_writes = 0;
+    BgWriterShmem->num_backend_fsync = 0;

     n = BgWriterShmem->num_requests;
     if (n > 0)
@@ -1149,6 +1157,9 @@ AbsorbFsyncRequests(void)

     LWLockRelease(BgWriterCommLock);

+    if (n > 0)
+        elog(DEBUG1,"Absorbing %d fsync requests",n);
+
     for (request = requests; n > 0; request++, n--)
         RememberFsyncRequest(request->rnode, request->forknum, request->segno);

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a1ad342..c3c136a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4188,6 +4188,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
     globalStats.buf_written_clean += msg->m_buf_written_clean;
     globalStats.maxwritten_clean += msg->m_maxwritten_clean;
     globalStats.buf_written_backend += msg->m_buf_written_backend;
+    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
     globalStats.buf_alloc += msg->m_buf_alloc;
 }

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c72da72..7140b94 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1269,6 +1269,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
         if (ForwardFsyncRequest(reln->smgr_rnode, forknum, seg->mdfd_segno))
             return;                /* passed it off successfully */

+        elog(DEBUG1, "Unable to forward fsync request, executing directly");
+
         if (FileSync(seg->mdfd_vfd) < 0)
             ereport(ERROR,
                     (errcode_for_file_access(),
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6edb8bf..2104b53 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -78,6 +78,7 @@ extern Datum pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_sync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);

 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
@@ -1165,6 +1166,12 @@ pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 }

 Datum
+pg_stat_get_buf_sync_backend(PG_FUNCTION_ARGS)
+{
+    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+}
+
+Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
     PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 4f444ae..7fc6f05 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3117,6 +3117,8 @@ DATA(insert OID = 2773 ( pg_stat_get_bgwriter_maxwritten_clean PGNSP PGUID 12 1
 DESCR("statistics: number of times the bgwriter stopped processing when it had written too many buffers while
cleaning");
 DATA(insert OID = 2775 ( pg_stat_get_buf_written_backend PGNSP PGUID 12 1 0 0 f f f t f s 0 0 20 "" _null_ _null_
_null__null_ pg_stat_get_buf_written_backend _null_ _null_ _null_ )); 
 DESCR("statistics: number of buffers written by backends");
+DATA(insert OID = 3063 ( pg_stat_get_buf_sync_backend PGNSP PGUID 12 1 0 0 f f f t f s 0 0 20 "" _null_ _null_ _null_
_null_pg_stat_get_buf_sync_backend _null_ _null_ _null_ )); 
+DESCR("statistics: number of backend buffer writtes that did their own sync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc            PGNSP PGUID 12 1 0 0 f f f t f s 0 0 20 "" _null_ _null_
_null__null_ pg_stat_get_buf_alloc _null_ _null_ _null_ )); 
 DESCR("statistics: number of buffer allocations");

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d49bd80..42bf9c4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -360,6 +360,7 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_buf_written_clean;
     PgStat_Counter m_maxwritten_clean;
     PgStat_Counter m_buf_written_backend;
+    PgStat_Counter m_buf_fsync_backend;
     PgStat_Counter m_buf_alloc;
 } PgStat_MsgBgWriter;

@@ -562,6 +563,7 @@ typedef struct PgStat_GlobalStats
     PgStat_Counter buf_written_clean;
     PgStat_Counter maxwritten_clean;
     PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
     PgStat_Counter buf_alloc;
 } PgStat_GlobalStats;

diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c7a796a..32cda64 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1293,7 +1293,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_activity            | SELECT s.datid, d.datname, s.procpid, s.usesysid, u.rolname AS usename,
s.application_name,s.client_addr, s.client_port, s.backend_start, s.xact_start, s.query_start, s.waiting,
s.current_queryFROM pg_database d, pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name,
current_query,waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u WHERE ((s.datid =
d.oid)AND (s.usesysid = u.oid)); 
  pg_stat_all_indexes         | SELECT c.oid AS relid, i.oid AS indexrelid, n.nspname AS schemaname, c.relname,
i.relnameAS indexrelname, pg_stat_get_numscans(i.oid) AS idx_scan, pg_stat_get_tuples_returned(i.oid) AS idx_tup_read,
pg_stat_get_tuples_fetched(i.oid)AS idx_tup_fetch FROM (((pg_class c JOIN pg_index x ON ((c.oid = x.indrelid))) JOIN
pg_classi ON ((i.oid = x.indexrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) WHERE (c.relkind = ANY
(ARRAY['r'::"char",'t'::"char"])); 
  pg_stat_all_tables          | SELECT c.oid AS relid, n.nspname AS schemaname, c.relname, pg_stat_get_numscans(c.oid)
ASseq_scan, pg_stat_get_tuples_returned(c.oid) AS seq_tup_read, (sum(pg_stat_get_numscans(i.indexrelid)))::bigint AS
idx_scan,((sum(pg_stat_get_tuples_fetched(i.indexrelid)))::bigint + pg_stat_get_tuples_fetched(c.oid)) AS
idx_tup_fetch,pg_stat_get_tuples_inserted(c.oid) AS n_tup_ins, pg_stat_get_tuples_updated(c.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(c.oid)AS n_tup_del, pg_stat_get_tuples_hot_updated(c.oid) AS n_tup_hot_upd,
pg_stat_get_live_tuples(c.oid)AS n_live_tup, pg_stat_get_dead_tuples(c.oid) AS n_dead_tup,
pg_stat_get_last_vacuum_time(c.oid)AS last_vacuum, pg_stat_get_last_autovacuum_time(c.oid) AS last_autovacuum,
pg_stat_get_last_analyze_time(c.oid)AS last_analyze, pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze,
pg_stat_get_vacuum_count(c.oid)AS vacuum_count, pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count,
pg_stat_get_analyze_count(c.oid)AS analyze_count, pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count FROM
((pg_classc LEFT JOIN pg_index i ON ((c.oid = i.indrelid))) LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
WHERE(c.relkind = ANY (ARRAY['r'::"char", 't'::"char"])) GROUP BY c.oid, n.nspname, c.relname; 
- pg_stat_bgwriter            | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
pg_stat_get_bgwriter_requested_checkpoints()AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS
buffers_checkpoint,pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean()
ASmaxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_alloc() AS buffers_alloc; 
+ pg_stat_bgwriter            | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
pg_stat_get_bgwriter_requested_checkpoints()AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS
buffers_checkpoint,pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean()
ASmaxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_sync_backend() AS
buffers_backend_sync,pg_stat_get_buf_alloc() AS buffers_alloc; 
  pg_stat_database            | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends,
pg_stat_get_db_xact_commit(d.oid)AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback,
(pg_stat_get_db_blocks_fetched(d.oid)- pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid)
ASblks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched,
pg_stat_get_db_tuples_inserted(d.oid)AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated,
pg_stat_get_db_tuples_deleted(d.oid)AS tup_deleted FROM pg_database d; 
  pg_stat_sys_indexes         | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid,
pg_stat_all_indexes.schemaname,pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname,
pg_stat_all_indexes.idx_scan,pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM
pg_stat_all_indexesWHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name,
'information_schema'::name]))OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text)); 
  pg_stat_sys_tables          | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname,
pg_stat_all_tables.relname,pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan,
pg_stat_all_tables.idx_tup_fetch,pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd,
pg_stat_all_tables.n_tup_del,pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count,
pg_stat_all_tables.autovacuum_count,pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM
pg_stat_all_tablesWHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name]))
OR(pg_stat_all_tables.schemaname ~ '^pg_toast'::text)); 

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Yeb Havinga
Дата:
Сообщение: Re: wCTE behaviour
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: wCTE behaviour